Pulp health checks

hyagi · June 27, 2022, 3:06pm

Problem statement

Without the health checks of the components, it is hard to ensure that deployment is highly available.

Background

In HA installations and/or pulp-operator deployments there is a reverse proxy (nginx/apache/elb) that balances the requests to the pulpcore-api and pulpcore-content nodes.

For k8s installations we also have probes (readiness and liveness probe) that help to manage the pods’ lifecycle and traffic distribution.

As of today, we don’t have a health check configured on the reverse proxy to manage the traffic to the endpoints. This way, if a worker node is in a failed state (but the others are healthy) the load balancer can still send requests to the failing node and will return a 5xx response to the client. With a health check configured on the load balancer, this could have been avoided because the load balancer would have been aware of the node outage and not redirected traffic to it.

The same situation can be avoided on k8s clusters, where we can use liveness probes to reprovision unhealthy pods and readiness probes to not send traffic to pods that are still not ready to handle the requests.

For cloud environments, like AWS, we can enable Elastic Load Balancing health checks for Auto Scaling group. This way we could not only automatically desprovision a failing instance but also automatically scale up them.

Motivation

Improve the user experience by making sure the client will not hit a node in a failed state (or in a worse scenario, a specific client can also be redirected multiple times to the same failed node whereas the others can be redirected to the healthy ones. Depending on the load balancer algorithm (maybe with sticky session enabled) the same client can always be redirected to the same failed node, having the feeling that the entire cluster is out).
Improve resource consumption by removing failed pods from k8s cluster and better balance the traffic among the running pods.
Improve resource consumption by removing failed instances from cloud clusters and better balance the traffic among them.
Improve resource consumption by removing failed resources from bare-metal environments (considering that it is configured with pacemaker or any other cluster resource manager) and better balance the traffic among them.

Idea

Provide a way to do a health check of pulpcore-content and pulpcore-worker nodes. The health check should verify:

Communication with the database
Communication with storage (for example: if s3, can it reach the endpoint? if mount volume, is the mount point reachable, etc)
Permission to read/write storage (will help in situations like: s3 credentials expired/rotated, nfs permissions changed, etc)
Storage space available

For pulpcore-content the ideal situation would be to expose the check through an HTTP endpoint (so that it could be configured on the load balancer).

For pulpcore-worker, as the traffic is not proxied through a load balancer, we can configure a script that will be invoked/run by kubelet.

Value

The health check will not provide a new feature for Pulp users but will improve the product “reliability” and HA.

fao89 · June 27, 2022, 3:27pm

@bmbouter @x9c4 as you are the SME for the tasking system,
do you have an idea of how to health check the workers?

I was thinking about using online but it relies on WORKER_TTL, which could be 30 seconds in the past, but I want to ensure the worker is alive/healthy now

x9c4 · June 27, 2022, 3:35pm

The workers operate kind of lazy in the background. There is nothing routed to them, but instead they pick their own work up when ready. So i do not see so much benefit of a realtime liveness probe. However, we maybe should use the worker heartbeat to fuel some watchdog, so a failed worker will be re-provisioned in a reasonable timeframe.

dkliban · June 28, 2022, 1:50pm

RE: Workers - since they are not accessed by users directly, a ‘health check’ API is not required. However, it would be nice to add an API that shows some statistics on queue length, average wait time, etc. This new information could help make decisions on scaling the number of workers.

RE: Content App - performing a HEAD request to /pulp/content/ should be sufficient enough for determining that the Content App is available. On installations with lots of distributions this might cause a performance problem. We can improve the handler code to handle HEAD requests differently.

hyagi · June 28, 2022, 2:06pm

Great!
Thank you very much for the explanations!

bmbouter · June 28, 2022, 3:44pm

I agree w/ @dkliban that worker health checks aren’t important to determine the overall health of the system. Pulp can safely accept API calls to all of its services even with 0 workers online.

If the goal is to check on the individual health of a worker though you could use use this API call to see the status of a specific worker.