When working with Kubernetes, a question that will always come up is: “how to set up proper health checks”? The Kubernetes documentation provides some basic information about why you might want to use them and how you configure them. But, as with all things–it is a bit more complicated than that.
Types of failures
Applications are guaranteed to fail at some time during their lifetime. They might encounter an unexpected error and crash, or they might be thrown off course due to a downstream service being unavailable (which shouldn’t happen, but hey: welcome to the wonderful world of distributed computing). Or maybe you just pushed a failed configuration and your service won’t start anymore.
What you can’t do, is hoping to avoid these situations. What you can do, is making sure the application recovers as quickly as possible if the unexpected happens. Kubernetes can help in that regard with some tools that monitor your application:
- liveness probe: check if your application responds to an action (an HTTP call, a TCP connection, …). Trigger: memory leak, depleted thread pool. Resolution: restart the container.
- readiness probe: check if your application can serve traffic. Trigger: broken connection to downstream service, database down. Resolution: remove from load balancing.
- startup probe: a separate liveness probe to monitor the startup phase if the load behavior differs significantly between. Trigger: see liveness probe. Resolution: see liveness probe.
Each probe has settings for the action, intervals, timeouts, and thresholds. But the pure existence of these probes and them being part of the Kubernetes documentation, doesn’t mean that every application should use all of them.
To restart or not to restart!?
Modern cloud orchestrators are “self-healing”. The easiest answer to many problems is simply to restart/replace something (A node is not responding! Evacuate all pods to other nodes and replace the failed one!). But it’s not the universal solution for every situation:
- If my application returns
500 INTERNAL SERVER ERROR; does a restart fix it? (maybe?)
- If the database is down; does a restart fix it? (probably not)
- If the application is too slow; is it simply overloaded or does a restart fix it? (a restart will probably make it worse)
So there are scenarios where a restart will actually make it worse and might also trigger a service outage in the first place.
A good practice is of course to make sure the application itself is resilient enough to e.g. reconnect to a database or other dependencies without requiring a restart. On the other hand, if you know that an application can be in a state where it can only recover by restarting, don’t expect the orchestrator to notice it via a liveness probe, but consider crashing it on purpose. The idea of the probes is to be a last fail-safe to handle unexpected and otherwise uncontrollable situations.
What should I configure?
I’m not the first who worried about how to configure the probes to actually increase the application stability. So let’s see what others propose to do:
- DO set up a readiness probe: “if you don’t set the readiness probe, the kubelet assumes that the app is ready to receive traffic as soon as the container starts.”
- DO crash if a fatal error occurs: “if the application reaches an unrecoverable error, you should let it crash.”
- DO configure a passive liveness probe: “if your application is processing an infinite loop, there’s no way to exit or ask for help.”
- DO NOT use the liveness probe to handle fatal errors: “the Liveness probe should be used as a recovery mechanism only in case the process is not responsive.”
- DO check dependencies in readiness probes that are exclusive for the pod.
- DO NOT check shared dependencies that will affect all replicas.
- DO use realistic and conservative timeouts for readiness probes.
- DO NOT check dependencies in liveness probes.
- DO “regularly restart containers to exercise startup dynamics and avoid unexpected behavioural changes during initialization”.
- DO “always define a Readiness Probe which checks that your application (Pod) is ready to receive traffic.”
- DO make sure the readiness endpoint runs on the same web server and resources as the production endpoints.
- DO NOT depend probes on external dependencies that are shared between pods.
- DO NOT depend probes on other services in the same cluster.
- DO NOT “use a Liveness Probe for your Pods unless you understand the consequences.”
- DO NOT “set the same specification for Liveness and Readiness Probe.”
Let’s compile a short list to check before you configure any health check probe:
- Do you understand what problems each probe wants to solve?
- Are you aware of the default timings and thresholds of the probes?
- Do the startup time and the worst-case latencies of the probe endpoints match the configured settings?
- Are you not passing responsibilities from your app to the runtime (e.g. signal an error vs. simply crashing)?
I take away that I will probably always want to have a conservative readiness probe and a very basic liveness probe.
However, I would also be cautious with the readiness probes in combination with dependencies; after all, it might be better to pass a
500 error to your consumers instead of having the reverse proxy return a
404 because all service instances were removed from the load balancing.
Maybe a startup probe is all I need, then.