When to allow removal of remotes used for on-demand syncing?

Problem Description

Right now, users are allowed to delete a remote anytime. The deletion can disrupt content artifacts that contain the reference to remote artifacts associated with the remote.

When a remote is deleted (and there is no other backup left - another remote) the publication machinery in pulp_file fails because it cannot fetch the URL for any of the synced content units.

Proposed Solution

To prevent such from happening, we propose adding a check hooked to the PRE_DELETE signal to disallow the removal of a remote when no other remote artifacts are left. This forces users to delete all synced repositories and synced content before deleting the actual remote. As a result, content artifacts will not end up in an inconsistent state.

Disadvantages
  1. The solution adds more burden on users when deleting regular remotes.
  2. The solution may not be performance efficient since we need to check all content artifacts to have at least one remote artifact.
  3. The solution does not guarantee that in the future we will have more than one remote artifact per content artifact (then considering on_delete=PROTECT would be an option as well: Should we keep using FK with on_delete=SET_NULL?).

Workflows

Remote artifacts are created during the sync procedure when the on-demand policy is used. When a user syncs a repository with two remotes pointing to the same destination, two remote artifacts per one content artifact are created:

pulp file remote create --name remo11 --url https://fixtures.pulpproject.org/file/PULP_MANIFEST --policy on_demand

pulp file remote create --name remo1 --url https://fixtures.pulpproject.org/file/PULP_MANIFEST --policy on_demand

pulp file repository create --name repo

pulp file repository sync --name repo --remote remo1

pulp file repository sync --name repo --remote remo11

So, in this case, removing one of the remotes will not cause any harm. Removing both of the remotes will cause the publication machinery to fail. The proposed protection does not allow users to delete both of the remotes.

References

The proposed solution comes from Problem: Deleting Remote deletes RemoteArtifacts - HackMD (1c). The PR that implements the solution is available at https://github.com/pulp/pulpcore/pull/2836.

I have said this before elsewhere: I have always been conceptually puzzled by the very loose coupling between remotes and repositories/repo versions in Pulp. In theory one could imagine a use case of creating several remotes, and attempting to sync all of them in turn into a single repository (which would be an argument for the loosely coupled status quo).

However, I think in practice users almost always imagine a tight coupling between a particular repo and a particular remote which is always used for syncing to that repo.

Right now, if I sync some remote into a repo the only place that records this link is the sync task, and once that task is deleted (because I need to clean up the gazillion old tasks clogging up my machine), the information is lost. Even if the task still exists it requires effort to retrieve the right task, just to check what remote was used for some repo version.

I am aware that it is possible for users to link up a repo with a remote, but this puts the onus on the user to choose a particular workflow to achieve something which reasonable users would expect to be the default behaviour, and big users (like Katello) don’t do this.

To me, the problem described in this thread is just another example consequence of this underlying conceptual muddle. I think remotes should be tightly coupled to repo versions created using the remote (via a remote field on the repo version model that contains the relevant remote href), and only be deleted once those repo versions no longer exist. (It is not expensive to keep a remote around in the DB).

3 Likes

I agree, coupling remotes with repositories should suffice. So, to remove the remotes, one should first remove the repositories/repository versions used for syncing.

However, there might be a problem with ACS. The remotes referenced from the ACS shall not be coupled.

1 Like

I just realized coupling remotes with repo versions (as I suggested) would only make sense if remotes were immutable. That is maybe something one might discuss for Pulp 4 or 5. :stuck_out_tongue_winking_eye:

(Isn’t mutability another problem for your remote artifacts use case? What happens to remote artifacts when the user alters the URL on the remote?)

So yes, for the immediate problem you describe just requiring remotes that will create remote artifacts to be linked to a repo and disallowing deleting as long as that link exists might work.

1 Like

Yes, indeed mutability is basically the same issue as deletability. But also coupling to the repo-version that was synced may not be enough, because (if not mirrored) this version may contain content of older versions that may be coupled to different remotes. So really the least common denominator is the RemoteArtifact. Both mutability and deletability could be circumvented by creating an immutable only garbage collected copy of a remote at sync time that is invisible to the user. Not sure if this snapshotting is a usable concept.

Sounds like it would almost be easier to just store the full URL on each and every remote artifact. How expensive would that duplication be?

We already do this https://github.com/pulp/pulpcore/blob/main/pulpcore/app/models/content.py#L675=

It’s not really a duplication because this URL points to the specific artifact whereas the remote URL points to the root of a repository. So the “remote” on a remote artifact is used as a configuration for the sync, and the URL present there isn’t helpful in terms of finding the artifact.

In the past I was looking at whether it might be possible to copy-on-write the remotes, but DRF isn’t designed for this at all. Also if you change the remote you might want all the remote artifacts to use the new configuration. This is another area where perhaps having the URL be part of what we currently call a Remote might not have been a good idea. How it is treated is a deviation in several ways. re: https://github.com/pulp/pulpcore/issues/2824

I still have a preference towards “fake deletion”, wherein we would secretly keep the remotes around but hide them from the API by setting a flag on the model. And then orphan cleanup could do a final garbage collection of any remotes for which no RemoteArtifacts exist.

Short term I’m OK with having a better error message. I don’t think we should close the issue though.