Repo version cleanup problem

Inspired by the thread about running Pulp in Production, I figured I’d share a problem that we’ve run into and that’s with cleaning up orphans.

Say you’re running Pulp in production for quite some time and you have a repo that’s collected hundreds of versions. Eventually you want to clean up these versions and their content. Pulp provides an automated way to do this so you don’t have to do it for each individual repo and that’s via “retain_repo_versions”.

The problem is that the retain_repo_versions feature is unsafe. Let me give an example: let’s say you have repo whose id is set on a distribution and its current version 98 is currently published. You have a bunch of versions (and their content) to cleanup so you set retain_repo_versions to 3. However, if you add 4 packages to your repo one-by-one, then the current version is 102 (unpublished) and version 98 (published) gets deleted. In turn your repo becomes un-distributed.

We could set retain_repo_versions to a very high number (e.g. 100) but users often do crazy things (like add 101 packages one at a time with a script or something). So for the time being, we’ve disabled retain_repo_versions and content is beginning to accumulate in our system.

2 Likes

I was thinking about writing a script that would essentially perform the same behavior as retain_repo_versions does (but in a safe way). If I do, I’ll share it back here.

@davidd It feels like what we’re discovering here is “all repo-versions are equal, but some are More Equal than others”. Versions with associated Distributions > versions with associated Publications > ‘plain’ versions. Or something like that.

The heuristics feel hairy - maybe scripting from “outside” is the best place to put the “brains”. Let us know how it goes.

1 Like

A thought - what if “retain-repo-versions” came with a “–lock-distributions” option? So only versions that were not being referenced by a Distribution would be available to be reaped?

Or even - what if the logic was just “Pulp won’t auto-cleanup versions that are distributed, if you want those gone you’ll need to do it yourself”?

@davidd Thoughts?

(Note that I’m not even sure how possible this is, I haven’t looked at the retain-versions codepaths yet. This is just Thoughts In My Brain)

1 Like

I think either of those options make sense. Repo cleanup happens in the background so the lock distributions (or maybe “lock distributed repo versions”?) would need to be a setting.

One other thing I want to add is that probably for most users, I think retain_repo_versions would be unusable in its current state given that very few users will want to use it if it deletes distributed repo versions.

One can make the case that “most” (or at the very least “very many”) users want a single distribution per repo, that always serves the most-recent version/publication. They turn on autopublish, create a Distribution for the repository, and then sync every day and magic happens. They retain, say, five versions, so that if disaster strikes on a given day, they can point their distribution to a previous repo-version while they’re working to fix whatever they’re sync’ing.

That doesn’t detract at all from the need or discussion here - just that there is a very common workflow that the current retain_versions implementation serves perfectly.

You could also argue that deleting a repository version that will autodelete a publication or unlink a distribution without reserving the respective resources is a bug.
Anyway i can see a clear case for “do not delete something that is currently distributed”. With the indirection through publications, it is much harder to see what is actually desired.

1 Like

I thought of another issue that kind of throws a wrench in things: we’ve talked about users being able to snapshot repos and then rollbacking to these snapshots. If we enable retain_repo_versions then this isn’t possible. Maybe Pulp could offer a way to protect certain repo versions? Have other users requested such a feature?

1 Like