Support zero-downtime updates

teg · September 29, 2022, 4:38pm

I’d like to run pulp in open shift, and be able to update the containers without losing availability. Is that something you currently support, and if not, is that something you would consider?

awcrosby · October 3, 2022, 2:00pm

+1 to this for our usage of pulp with Ansible Automation Hub hosted on OpenShift. We would like to keep the service available during upgrades.

This would be useful even in the case where you can immediately issue commands to bring down/up services. In our setup we use gitops to manage spinning down/up services which brings many benefits, but it also means extra wait time to run CI on PRs to bring down services, start an upgrade, and restart services, resulting in extra downtime.

hyagi · October 3, 2022, 3:06pm

Hi,

The preferable way to install pulp on openshift is the operator which you can find on OCP catalog or operatorhub.

Regarding downtime, this is a tough question because it involves some architectural stuff that depends on how you deploy/configure your operator and the update path.
Considering that by “update the containers” you mean modifying the container image that is deployed:

by default, Pulp operator provision the k8s Deployments with RollingUpdate strategy (which supports running multiple versions of an application at the same time) and will ensure that a minimum number of replicas will be available.
to avoid issues with the RollingUpdate you would need to have a RWX volume (if the components are configured with RWO the reconcile loop can get stuck)
it is possible to define the number of replicas for each component (api/content/worker) through Pulp CR
the current operator implementation does not provision a Cluster of PostgreSQL containers. If you have a running PostgreSQL cluster you can configure Pulp operator to use it.
depending on the upgrade path version, there is a small chance that Pulp will need to run django migrations tasks that would cause downtime

bmbouter · October 3, 2022, 4:02pm

One significant issue preventing this is in pulp application itself which currently requires upgrades to work like this:

Stop all pulp services
Install the upgraded code
Apply migrations
Start the upgraded services

What we want is:

Install the upgraded code
Apply migrations
Restart the services when you can, e.g. a rolling restart would be good

The only thing preventing (that I know of) that is the migrations which can apply breaking changes to the DB which would make it incompatible with the older pulp code running the existing services before upgrade. For example, consider a db column rename as soon as that migration is applied any unstopped, or unrestarted-with-upgraded-code service, would start to 500 error due to still expecting the old column name.

The solution I believe is to have a policy change for migrations project-wide whereby any “breaking db changes” would be split into two parts, an additive change and a destructive change. There is a lot more info about this here django-pg-zero-downtime-migrations · PyPI

dkliban · October 4, 2022, 1:15pm

Here is a document on what needs to happen with regard to Django migrations: Django migrations without downtime · GitHub

x9c4 · October 6, 2022, 8:09am

This document covers even more ground:

https://pankrat.github.io/2015/django-migrations-without-downtimes/

x9c4 · October 6, 2022, 8:41am

My conclusions after reading some of these articles:

Zero downtime does not come as a free gift. We cannot just add a library that takes care of it, but we need to be very careful writing migrations, in a way that both new and old code can work with it. We may need a lot of database version aware code.
In any case, it is worth shutting down workers while migrating.
Maybe we can introduce a maintenance mode where no costly tasks can be dispatched, but at least the content app stays online to reduce the code that needs awareness.
Our plugin infrastructure makes things additionally complex.
We cannot upgrade from anywhere without downtime.
We need to select windows for zero downtime upgrades. (single commits; whole releases?)
We probably need to align destructive database migrations with the deprecation policy.
Not allowing migrations in z-stream releases was a very wise decision.

bmbouter · October 10, 2022, 2:15pm

Thank you for the investigation and info. Can we find some time to discuss the findings? I’d like to join; can others who would like to join also post here saying so?

It would be helpful if more details can be written out about a lot of the bullet points. The main question I have for most of them is “why not?”. For example, “We cannot upgrade from anywhere without downtime.” I’m wondering “why not?”. I think I know the reason, but having a written version would help us get there as a group quicker.

ggainey · October 10, 2022, 6:52pm

I’d like to be in on such a meeting, if/when/as it happens.

dkliban · October 11, 2022, 1:45pm

I would like to participate in this working group.

awcrosby · October 11, 2022, 8:38pm

I would like to participate in this discussion

ipanova · October 12, 2022, 3:15pm

+me

PurplePaul · October 12, 2022, 5:59pm

+1 to this requirement. We (very large enterprise @bmbouter is familiar with) are also interested in Pulp 3 supporting this.

bmbouter · October 18, 2022, 8:20pm

I setup the first working group call for this Thursday. We’ll try to take excellent notes for those who cannot attend. It’s designed to be a technical call to focus on the problem statements, maybe some possible solutions, and determine some experiments we can run (hopefully) as next steps. We’ll post the notes on this thread. If anyone else wants to join please post here also. Thanks!

bmbouter · October 20, 2022, 7:54pm

Our working group met today and focused on characterizing the problem (not solving it). See the notes here.

We’ll meet again next week to continue the discussion. If you are interested in joining the call please post here, otherwise we’ll share the notes next week also.

bmbouter · October 27, 2022, 7:51pm

The group met today, and added some more content to the notes document.

One thing that became clear is that it would be helpful to have two things implemented to make zero downtime migrations easier:

Our next meeting is on Nov 14th to let us focus on Pulpcon which is the week before.

bmbouter · November 14, 2022, 5:29pm

The group met today, and we focused on improving the epic, which got revisions to the work and a few more subtasks planned.

Our next meeting is on Dec 1, and we’ll hopefully be focusing on an agreement for a limited set of plugins and maybe pulpcore to begin following zero-downtime policies (that we’ll document) to give us practical experience for this problem we’ve now analyzed theoretically.

bmbouter · December 5, 2022, 9:51pm

The group met last Thursday and determined that the most reasonable next steps are:

Create plugin writer documentation that documents the policies plugin writers need to adhere to for a plugin to be zero-downtime compatible. Also document the tools they have in their toolbox when doing so.
Have pulpcore and pulp_file run an experiment whereby only zero-downtime migrations are written going forward. This should be discussed at the Dec 6th pulpcore meeting.

bmbouter · December 6, 2022, 2:59pm

At the pulpcore meeting today, we determined the following:

I’ll open docs PR for these two issues: https://github.com/pulp/pulpcore/issues/3368 and https://github.com/pulp/pulpcore/issues/3443
We’ll review ^ changes through normal review process
Starting as soon as ^ are merged, pulpcore and pulp_file will run an experiment where breaking changes are no longer allowed. However, we are not guaranteeing this to our users at this time.

bmbouter · January 17, 2023, 3:38pm

The zero-downtime documentation has been merged. All plugins are encouraged to start trying to adhere to those requirements as we are attempting to learn from situations when challenging aspects come up. See the docs here: https://github.com/pulp/pulpcore/blob/main/docs/plugin_dev/plugin-writer/concepts/index.rst#zero-downtime-upgrades