（转）浅析Kubernetes CRI-526互联

1、概述

我们知道 Kubernetes 提供了一个 CRI 的容器运行时接口，那么这个 CRI 到底是什么呢？这个其实也和 Docker 的发展密切相关的。(说明：所谓接口，就是它定义了一个行为，你只要按我的行为来实现了，那么我就能支持你；)

在 Kubernetes 早期的时候，当时 Docker 实在是太火了，Kubernetes 当然会先选择支持 Docker，而且是通过硬编码的方式直接调用 Docker API，后面随着 Docker 的不断发展以及 Google 的主导，出现了更多容器运行时介绍，Kubernetes 为了支持更多更精简的容器运行时介绍，Google 就和红帽主导推出了 CRI 标准，用于将 Kubernetes 平台和特定的容器运行时（当然主要是为了干掉 Docker）解耦。

CRI（Container Runtime Interface 容器运行时接口）本质上就是 Kubernetes 定义的一组与容器运行时进行交互的接口，所以只要实现了这套接口的容器运行时都可以对接到 Kubernetes 平台上来。不过 Kubernetes 推出 CRI 这套标准的时候还没有现在的统治地位，所以有一些容器运行时介绍可能不会自身就去实现 CRI 接口，于是就有了 shim（垫片），一个 shim 的职责就是作为适配器将各种容器运行时本身的接口适配到 Kubernetes 的 CRI 接口上，其中 dockershim 就是 Kubernetes 对接 Docker 到 CRI 接口上的一个垫片实现。

从上图可以看到，CRI 主要有 gRPC client、gRPC Server 和具体的容器运行时三个组件。其中 Kubelet 作为gRPC 的客户端来调用 CRI 接口；CRI shim 作为 gRPC 服务端负责将 CRI 请求的内容转换为具体的容器运行时 API并响应 CRI 请求，在 Kubelet 和容器运行时之间充当翻译的角色。具体的容器创建逻辑是，Kubernetes 在通过调度指定一个具体的节点运行 Pod，该节点的 Kubelet 在接到 Pod 创建请求后，调用一个叫作 GenericRuntime 的通用组件来发起创建 Pod 的 CRI 请求给 CRI shim；CRI shim 监听一个端口来响应 Kubelet，在收到 CRI 请求后，将其转化为具体的容器运行时指令，并调用相应的容器运行时来创建 Pod。

2、CRI规范

根据概述可知，任何容器运行时想要接入 Kubernetes，都需要实现一个自己的 CRI shim，来实现 CRI 接口规范。(说明：随着 CRI 方案的发展以及其他容器运行时介绍对 CRI 的支持越来越完善，很多容器运行时可能去掉shim这个代理层，直接以插件的形式实现CRI接口，使得调用链更加简洁，比如 containerd 1.1 )

那么 CRI 有哪些接口需要实现呢？

最新的 CRI 定义位于 Kubernetes 源码包 kubernetes/api.proto，主要定义了两类接口 ImageService 和 RuntimeService。

ImageService 主要定义拉取镜像、查看和删除镜像等操作。ImageService 的操作比较简单，就是拉取、删除、查看镜像状态及获取镜像列表这几个操作，下面我们着重介绍下RuntimeService 。

// ImageService defines the public APIs for managing images.
service ImageService {
    // ListImages lists existing images.
    rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
    // ImageStatus returns the status of the image. If the image is not
    // present, returns a response with ImageStatusResponse.Image set to
    // nil.
    rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
    // PullImage pulls an image with authentication config.
    rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
    // RemoveImage removes the image.
    // This call is idempotent, and must not return an error if the image has
    // already been removed.
    rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
    // ImageFSInfo returns information of the filesystem that is used to store images.
    rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}

RuntimeService 定义了容器相关的操作，包括管理容器的生命周期，以及与容器交互的调用（exec/attach/port-forward）等操作。

service RuntimeService {
    // Version returns the runtime name, runtime version, and runtime API version.
    rpc Version(VersionRequest) returns (VersionResponse) {}

    // RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
    // the sandbox is in the ready state on success.
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
    // StopPodSandbox stops any running process that is part of the sandbox and
    // reclaims network resources (e.g., IP addresses) allocated to the sandbox.
    // If there are any running containers in the sandbox, they must be forcibly
    // terminated.
    // This call is idempotent, and must not return an error if all relevant
    // resources have already been reclaimed. kubelet will call StopPodSandbox
    // at least once before calling RemovePodSandbox. It will also attempt to
    // reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
    // multiple StopPodSandbox calls are expected.
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
    // RemovePodSandbox removes the sandbox. If there are any running containers
    // in the sandbox, they must be forcibly terminated and removed.
    // This call is idempotent, and must not return an error if the sandbox has
    // already been removed.
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
    // PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
    // present, returns an error.
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
    // ListPodSandbox returns a list of PodSandboxes.
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

    // CreateContainer creates a new container in specified PodSandbox
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
    // StartContainer starts the container.
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
    // StopContainer stops a running container with a grace period (i.e., timeout).
    // This call is idempotent, and must not return an error if the container has
    // already been stopped.
    // The runtime must forcibly kill the container after the grace period is
    // reached.
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
    // RemoveContainer removes the container. If the container is running, the
    // container must be forcibly removed.
    // This call is idempotent, and must not return an error if the container has
    // already been removed.
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
    // ListContainers lists all containers by filters.
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
    // ContainerStatus returns status of the container. If the container is not
    // present, returns an error.
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
    // UpdateContainerResources updates ContainerConfig of the container synchronously.
    // If runtime fails to transactionally update the requested resources, an error is returned.
    rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
    // ReopenContainerLog asks runtime to reopen the stdout/stderr log file
    // for the container. This is often called after the log file has been
    // rotated. If the container is not running, container runtime can choose
    // to either create a new log file and return nil, or return an error.
    // Once it returns error, new container log file MUST NOT be created.
    rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

    // ExecSync runs a command in a container synchronously.
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
    // Exec prepares a streaming endpoint to execute a command in the container.
    rpc Exec(ExecRequest) returns (ExecResponse) {}
    // Attach prepares a streaming endpoint to attach to a running container.
    rpc Attach(AttachRequest) returns (AttachResponse) {}
    // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
    rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

    // ContainerStats returns stats of the container. If the container does not
    // exist, the call returns an error.
    rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
    // ListContainerStats returns stats of all running containers.
    rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}

    // PodSandboxStats returns stats of the pod sandbox. If the pod sandbox does not
    // exist, the call returns an error.
    rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
    // ListPodSandboxStats returns stats of the pod sandboxes matching a filter.
    rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}

    // UpdateRuntimeConfig updates the runtime configuration based on the given request.
    rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

    // Status returns the status of the runtime.
    rpc Status(StatusRequest) returns (StatusResponse) {}

    // CheckpointContainer checkpoints a container
    rpc CheckpointContainer(CheckpointContainerRequest) returns (CheckpointContainerResponse) {}

    // GetContainerEvents gets container events from the CRI runtime
    rpc  GetContainerEvents(GetEventsRequest) returns (stream ContainerEventResponse) {}

    // ListMetricDescriptors gets the descriptors for the metrics that will be returned in ListPodSandboxMetrics.
    // This list should be static at startup: either the client and server restart together when
    // adding or removing metrics descriptors, or they should not change.
    // Put differently, if ListPodSandboxMetrics references a name that is not described in the initial
    // ListMetricDescriptors call, then the metric will not be broadcasted.
    rpc ListMetricDescriptors(ListMetricDescriptorsRequest) returns (ListMetricDescriptorsResponse) {}

    // ListPodSandboxMetrics gets pod sandbox metrics from CRI Runtime
    rpc ListPodSandboxMetrics(ListPodSandboxMetricsRequest) returns (ListPodSandboxMetricsResponse) {}
}

从接口中可以看出 RuntimeService 除了有 container 的管理接口外，还包含 PodSandbox 相关的管理接口和exec 、attach 等与容器交互的接口。

PodSandbox 这个概念对应的是 Kubernetes 里的 Pod，它描述了 Kubernetes 里的 Pod 与容器运行相关的属性或者信息，如 HostName、CgroupParent 等，设计这个的初衷是因为 Pod 里所有容器的资源和环境信息是共享的，但是不同的容器运行时实现共享的机制不同，如 Docker 中 Pod 会是一个 Linux 命名空间，各容器网络信息的共享通过创建pause 容器的方法来实现，而 Kata Containers 则直接将 Pod 具化为一个轻量级的虚拟机，将这个逻辑抽象为PodSandbox 接口，可以让不同的容器运行时在 Pod 实现上自由发挥，自己解释和实现 Pod 的逻辑。

Exec 、 Attach 和 PortForward 是三个和容器进行数据交互的接口，由于交互数据需要长链接来传输，这些接口被称为 Streaming API 。CRI shim 依赖一套独立的 Streaming Server 机制来实现客户端与容器的交互需求。长连接比较消耗网络资源，为了避免因长连接给 Kubelet 节点带来网络流量瓶颈，CRI 要求容器运行时启动一个对应请求的单独的流服务器，让客户端直接与流服务器进行连同交互。

上图所示， kubectl exec 命令实现过程如下：

客户端发送 kubectl exec 命令给 apiserver；
apiserver 调用 Kubelet 的 Exec API；
Kubelet 调用 CRI 的 Exec 接口（具体的执行者为实现该接口的 CRI Shim ）；
CRI Shim 向 Kubelet 返回 Streaming Server 的地址和端口；
Kubelet 以 redirect 的方式返回给 apiserver
apiserver 通过重定向来向 Streaming Server 发起真正的 /exec/{token} 请求，与它建立长连接，完成 Exec 的请求和响应。

3、Kubelet CRI架构

Kubelet 在引入 CRI 之后，主要的架构如下图所示，其中 Generic Runtime Manager 负责发送容器创建、删除等CRI 请求，Container Runtime Interface(CRI) 负责定义 CRI 接口规范，具体的 CRI 实现可分为两种：Kubelet 内置的 dockershim 和远端的 CRI shim。

其中 dockershim 是 Kubernetes 自己实现的适配 Docker接口的 CRI 接口实现，主要用来将 CRI 请求里的内容组装成 Docker API 请求发给 Docker Daemon；

远端的 CRIshim 主要是用来匹配其他的容器运行时工具到Kubelet。CRI shim 主要负责响应 Kubelet 发送的 CRI 请求，并将请求转化为具体的运行时命令发送给具体的运行时（如 runc、kata 等）；Stream Server 用来响应客户端与容器的交互。除此之外，CRI 还提供接入 CNI 的能力以实现 Pod 网络的共享，常用的远端 CRI 的实现有 CRI-Containerd、CRI-O 等。

可以通过 Kubelet 中的标志 --container-runtime-endpoint 和 --image-service-endpoint 来配置这两个服务的套接字。

4、Kubernetes为何抛弃Docker

不过这里同样也有一个例外，那就是 Docker，由于 Docker 当时的江湖地位很高，Kubernetes 是直接内置了 dockershim 在 Kubelet 中的，所以如果你使用的是 Docker 这种容器运行时介绍的话是不需要单独去安装配置适配器之类的，当然这个举动似乎也麻痹了 Docker 公司。

现在如果我们使用的是 Docker 的话，当我们在 Kubernetes 中创建一个 Pod 的时候，首先就是 Kubelet 通过 CRI 接口调用 dockershim，请求创建一个容器，Kubelet 可以视作一个简单的 CRI Client, 而 dockershim 就是接收请求的 Server，不过他们都是在 Kubelet 内置的。

dockershim 收到请求后, 转化成 Docker Daemon 能识别的请求, 发到 Docker Daemon 上请求创建一个容器，请求到了 Docker Daemon 后续就是 Docker 创建容器的流程了，去调用 containerd，然后创建 containerd-shim 进程，通过该进程去调用 runc 去真正创建容器。

其实我们仔细观察也不难发现使用 Docker 的话其实是调用链比较长的，真正容器相关的操作其实 containerd 就完全足够了，Docker 太过于复杂笨重了，当然 Docker 深受欢迎的很大一个原因就是提供了很多对用户操作比较友好的功能，但是对于 Kubernetes 来说压根不需要这些功能，因为都是通过接口去操作容器的，所以自然也就可以将容器运行时介绍切换到 containerd 来。

切换到 containerd 可以消除掉中间环节，操作体验也和以前一样，但是由于直接用容器运行时调度容器，所以它们对 Docker 来说是不可见的。因此，你以前用来检查这些容器的 Docker 工具就不能使用了。

你不能再使用 docker ps 或 docker inspect 命令来获取容器信息。由于不能列出容器，因此也不能获取日志、停止容器，甚至不能通过 docker exec 在容器中执行命令。

当然我们仍然可以下载镜像，或者用 docker build 命令构建镜像，但用 Docker 构建、下载的镜像，对于容器运行时介绍和 Kubernetes，均不可见。为了在 Kubernetes 中使用，需要把镜像推送到镜像仓库中去。

从上图可以看出在 containerd 1.0 中，对 CRI 的适配是通过一个单独的 CRI-Containerd 进程来完成的，这是因为最开始 containerd 还会去适配其他的系统（比如 swarm），所以没有直接实现 CRI，所以这个对接工作就交给 CRI-Containerd 这个 shim 了。

然后到了 containerd 1.1 版本后就去掉了 CRI-Containerd 这个 shim，直接把适配逻辑作为插件的方式集成到了 containerd 主进程中，现在这样的调用就更加简洁了。

与此同时 Kubernetes 社区也做了一个专门用于 Kubernetes 的 CRI 运行时 CRI-O，直接兼容 CRI 和 OCI 规范。

这个方案和 containerd 的方案显然比默认的 dockershim 简洁很多，不过由于大部分用户都比较习惯使用 Docker，所以大家还是更喜欢使用 dockershim 方案。

注意 1：据说cri-o的性能没有containerd好，并且containerd是已经在生产里经受住了考验，至少目前不推荐使用cri-o，建议使用containerd；

但是随着 CRI 方案的发展，以及其他容器运行时介绍对 CRI 的支持越来越完善，Kubernetes 社区在2020年7月份就开始着手移除 dockershim 方案了：https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2221-remove-dockershim，现在的移除计划是在 1.20 版本中将 Kubelet 中内置的 dockershim 代码分离，将内置的 dockershim 标记为维护模式，当然这个时候仍然还可以使用 dockershim，在 1.24 版本已经移出了dockershim 代码。

那么这是否就意味这 Kubernetes 不再支持 Docker 了呢？当然不是的，这只是废弃了内置的 dockershim 功能而已，Docker 和其他容器运行时介绍将一视同仁，不会单独对待内置支持。如果我们还想直接使用 Docker 这种容器运行时应该怎么办呢？可以将 dockershim 的功能单独提取出来独立维护一个 cri-dockerd 即可，就类似于 containerd 1.0 版本中提供的 CRI-Containerd，当然还有一种办法就是 Docker 官方社区将 CRI 接口内置到 Dockerd 中去实现。

但是我们也清楚 Dockerd 也是去直接调用的 Containerd，而 containerd 1.1 版本后就内置实现了 CRI，所以 Docker 也没必要再去单独实现 CRI 了。当 Kubernetes 不再内置支持开箱即用的 Docker 的以后，最好的方式当然也就是直接使用 Containerd 这种容器运行时介绍，而且该容器运行时介绍也已经经过了生产环境实践的，接下来我们就来学习下 Containerd 的使用。

cri-dockerd项目：GitHub - Mirantis/cri-dockerd

本地开发依然倾向用docker；用docker构建的镜像用contianerd也能跑啊，它们都遵守oci标准；docker太重了，问题很多，切到containerd以后性能和稳定性都有提升；

其实，现在docker已经在走标准的runc接口了。docker本身的问题：它有一个非常厚重的docker daemon,它本身的稳定性不是很好，所以现在既然它已经走了标准协议了，那么k8s所谓的不支持docker，其实，它可以通过，更轻量级，性能更好，更稳定的容器运行时介绍，例如containerd来替换docker，即docker本身已经在k8s这里已经没有存在的必要了。

为什么这么做？