Home

What is the problem/motivation?

As of version 1.27, Kubernetes does not yet have a mechanism to coordinate the lifecycle of multiple containers in a single pod, though the request exists for some time now and the topic is gaining traction recently with other projects benefiting from it. Whenever a pod is started, all (non-init) containers are initialized and started at the same time, with one noteworthy exception mentioned below. Likewise, when a pod is terminated due to a Kubernetes (eviction) API call, all containers receive a stop signal (SIGTERM by default) at the same time.

This usually becomes a problem when one or multiple containers act as a sidecar container, supporting a main container. One example of a sidecar container is the GCP CloudSQL auth proxy, providing a secure database connection to the main container. Other examples are service meshes like Istio or Linkerd providing networking capabilities to all containers in the pod.

In these cases, two things should happen:

When the pod is started, the sidecar container should start first, and become ready before the main container is started, to avoid the main container generating errors (and producing alerting error logs or metrics) and being restarted unnecessarily by Kubernetes, because it may try to connect to the database through the sidecar container before it is ready, or - in the context of service meshes - try to connect to any external service/IP through the service mesh proxy before it is ready.
When the pod is being terminated, i.e. by an external Kubernetes API event/call, the main container should terminate first, followed by the sidecar container to ensure that during graceful termination the main container can still use the services provided by the sidecar container, e.g. connections to the database or networking capabilities provided by service mesh proxies, if necessary

Yet another requirement exists for zero-downtime deployments due to the asynchronous nature of Kubernetes and how different pieces/parts process the termination of a pod:

A pod must shutdown gracefully, allowing any current and still-incoming future requests to be processed successfully. This usually means that a pod's container cannot shutdown immediately after receiving the stop signal (SIGTERM) but must continue to process requests normally for a few seconds, until all Kubernetes components, such as the kube-proxy updating a node's iptables rules or an Nginx Ingress Controller updating its upstream/backend servers, have taken this pod out of service as a potential backend in a Kubernetes service's Endpoints resource. So, container termination must be delayed for some arbitrary amount of time. See this fantastic post for more details: https://learnk8s.io/graceful-shutdown

In addition to such long-running deployment/statefulset scenarios, another requirement is added in the context of Kubernetes CronJobs/Jobs:

In a Kubernetes Job pod, containers are not restarted once they exit either normally or abnormally. A job pod's container is expected to end by itself after it is done with its task. When using a sidecar container alongside a main container, the sidecar usually doesn't know when the main container's task is done and when it can exit. So, even if the main container exited, the sidecar container will continue running, keeping the resource allocations of the whole pod and potentially avoiding the CronJob controller from spawning the next job in time (if no concurrency is allowed).

How do people currently solve these problems?

Starting a main container only after the sidecar became ready

Istio and Linkerd solve the problem of starting the main container only after the sidecar container - injected via Mutating Admission Webhook - became ready by making use of an implementation detail of the Kubelet. When the Kubelet running on a Kubernetes node recognizes a pod, which is scheduled on that node, it will synchronously process the list of containers in the pod's .spec.containers list and make calls to the Container Runtime to start them. This list is processed in a loop and the next container can only be asked to start via the Container Runtime after the previous container has been processed. In particular, the step of executing an exec .lifecycle.postStart hook handler is implemented in a way that it will block the next loop iteration for as long as that exec hook handler is running. While for some people, this is probably considered a bug or an unfortunate behaviour and have requested that this be changed, others have become dependent upon this behaviour making it rather hard/impossible to change afterwards.

This sounds like another instance of "Hyrum's Law", where:

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

Both Istio and Linkerd inject their sidecar containers into the pod at the very beginning of the .spec.containers list, with an additional .lifecycle.postStart exec command which probes the sidecar container's process and waits/blocks until it becomes ready. And as mentioned above, any container coming after that in the list of pod containers will start only after the postStart hook of the previous container exits. In Istio, this behaviour is not the default, though, but can be configured.

For other containers, like the GCP CloudSQL auth proxy, which are not injected by a Mutating Admission Webhook, container-specific solutions are used, which basically all revolve around using the same postStart hook blocking the start of the main container until the proxy is ready.

In the case of the CloudSQL Auth proxy, it provides a built-in HTTP web server for serving a /readiness healthcheck that can be used in a postStart hook or in combination with a Kubernetes container startupProbe or a readinessProbe to ensure that the pod is not taken into service by Kubernetes even if the main container started before the sidecar became ready. Since version 2.8.0, the CloudSQL proxy also supports waiting for itself to be ready using its own CLI command with a special wait subcommand.

Stopping the sidecar only after the main container exited

Since there is no way to hook into the Kubelet's process for delaying the shutdown of a pod's container and waiting for the shutdown of another container via e.g. preStop lifecycle hooks, every image/container must provide its own way of delaying termination after receiving the stop signal from Kubernetes.

In the case of Istio, there is the experimental EXIT_ON_ZERO_ACTIVE_CONNECTIONS flag to block termination of the sidecar container as long as the proxy has currently open connections. This is not enabled by default, because making assumptions about the main container being "done" just because it currently has no open connections is debatable. Other ideas for Istio involve wrapping the main container's process into another main (PID 1) process which will shutdown the Istio proxy once the container's actual wrapped main process exited. This is basically what Linkerd has been having since the beginning of 2019 with linkerd-await.

In the same way, the GCP CloudSQL auth proxy also delays termination when currently open connections to the database exist, providing the --max-sigterm-delay flag to specify the maximum delay after receiving the stop signal. But even with this, coordinating sidecar container shutdown is tricky when there are even multiple such sidecar containers behaving differently to stop signals.

Graceful shutdown in deployments/statefulsets

Because the processing of a Kubernetes pod termination event consists of many different Kubernetes components reacting to that event in an asynchronous and fault-tolerant way, there is no single "now the shutdown is completely done and now the pod will definitely not get any more requests" event that can be used to trigger the final termination of the pod's containers. Instead, the pod's containers must be shutdown after some arbitrary amount of time, to increase the chance that all Kubernetes components have taken the pod out of service by that time.

This is usually done via a preStop lifecycle hook that simply waits for some duration before exiting and thus delaying when the Kubelet sends the stop/termination signal (SIGTERM by default) to the container's main process.

Depending on the implementation of the main process, this waiting can also be implemented within the process, if it can trap the stop/termination signal. Because, one disadvantage of using the preStop lifecycle hook is that the container's file system must provide some executable that can be used as a hook handler. This becomes a problem when using minimal (or distroless) images where - apart from the main process'es executable - there simply isn't any other executable available (such as /bin/sleep).

Stopping the sidecar after the main container exited in CronJobs/Jobs

In the case of CronJobs/Jobs, the main container is expected to exit after it is done with its task. So, the sidecar container must also be stopped to avoid a dangling job pod that keeps the resource allocations of the whole pod and potentially avoiding the CronJob controller from spawning the next job in time, if no concurrency is allowed and the job/pod does not have a set timeout via activeDeadlineSeconds.

Solving this is trickier, since it always involves some kind of communication between the main container and the sidecar container. One solution is to use a shared volume between the main and the sidecar container, where the main container writes a file to the volume when it is done with its task and the sidecar container watching for that file and exiting when it is found.

Another solution, that requires shared process namespaces in a pod, is to let the main container (as part of its run command) kill the sidecar container's process (using the process id or process name) when it is done with its task.

The problem with killing the sidecar container, however, is the reported exit code of that sidecar container once it receives a SIGTERM signal. Depending on the implementation of the sidecar container's process, it is likely to be 143 (128 + 15), with 15 being the signal code of the SIGTERM signal. This is a problem for Kubernetes, since it expects a 0 exit code of all containers for a successfully completed pod and thus successful job execution. So, the main container must somehow communicate to the sidecar container that it is done with its task and that the sidecar container should exit with a 0 exit code for the whole job to report as successful.

Istio solves this by providing a /quitquitquit HTTP endpoint on the sidecar container's HTTP server, which, when called, will result in a shutdown with a 0 exit code. The GCP CloudSQL auth proxy supports the same /quitquitquit endpoint in its own HTTP server for the same purpose. Here, the main container should call this HTTP endpoint (as part of its container run command) after its process has exited in order to initiate shutdown of the proxy sidecar.

Yet other solutions simply time out the sidecar container either via a container run command like timeout 60 <mainprocess> or using the pod's activeDeadlineSeconds, to let Kubernetes terminate the pod and all its containers if it runs for longer than the allocated time, which is a valid option for jobs that are scheduled regularly anyway and are not expected to run very long.

Usage Scenarios

How to solve it in a generic way?

After having learnt about all the problems that exist in a multi-container pod (be it for a long-running deployment/statefulset or a short-lived CronJob/Job), let's explore how we can solve them in a generic way

without relying on Kubelet implementation details which (albeit being relied on pretty heavily by now) haven't been officially documented/specified
without having to enable shared process namespaces and killing the sidecar container's process from the main container
without using arbitrary activeDeadlineSeconds timeouts on a job pod to eventually terminate the sidecar container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly