Problem statement
Without the health checks of the components, it is hard to ensure that deployment is highly available.
Background
In HA installations and/or pulp-operator deployments there is a reverse proxy (nginx/apache/elb) that balances the requests to the pulpcore-api and pulpcore-content nodes.
For k8s installations we also have probes (readiness and liveness probe) that help to manage the pods’ lifecycle and traffic distribution.
As of today, we don’t have a health check configured on the reverse proxy to manage the traffic to the endpoints. This way, if a worker node is in a failed state (but the others are healthy) the load balancer can still send requests to the failing node and will return a 5xx response to the client. With a health check configured on the load balancer, this could have been avoided because the load balancer would have been aware of the node outage and not redirected traffic to it.
The same situation can be avoided on k8s clusters, where we can use liveness probes to reprovision unhealthy pods and readiness probes to not send traffic to pods that are still not ready to handle the requests.
For cloud environments, like AWS, we can enable Elastic Load Balancing health checks for Auto Scaling group. This way we could not only automatically desprovision a failing instance but also automatically scale up them.
Motivation
- Improve the user experience by making sure the client will not hit a node in a failed state (or in a worse scenario, a specific client can also be redirected multiple times to the same failed node whereas the others can be redirected to the healthy ones. Depending on the load balancer algorithm (maybe with sticky session enabled) the same client can always be redirected to the same failed node, having the feeling that the entire cluster is out).
- Improve resource consumption by removing failed pods from k8s cluster and better balance the traffic among the running pods.
- Improve resource consumption by removing failed instances from cloud clusters and better balance the traffic among them.
- Improve resource consumption by removing failed resources from bare-metal environments (considering that it is configured with pacemaker or any other cluster resource manager) and better balance the traffic among them.
Idea
Provide a way to do a health check of pulpcore-content and pulpcore-worker nodes. The health check should verify:
- Communication with the database
- Communication with storage (for example: if s3, can it reach the endpoint? if mount volume, is the mount point reachable, etc)
- Permission to read/write storage (will help in situations like: s3 credentials expired/rotated, nfs permissions changed, etc)
- Storage space available
For pulpcore-content the ideal situation would be to expose the check through an HTTP endpoint (so that it could be configured on the load balancer).
For pulpcore-worker, as the traffic is not proxied through a load balancer, we can configure a script that will be invoked/run by kubelet.
Value
The health check will not provide a new feature for Pulp users but will improve the product “reliability” and HA.