Remote fields vs sync options

Currently we either store what might be termed “sync options” on the remote, or else they can be specified on the sync API endpoint. Now it is not entirely obvious to me why for example remote.policy (can be one of immediate, on_demand, streamed) is a remote option, but for example the pulp_rpm sync_policy (can be one of additive, mirror_complete, mirror_content_only) can only be specified at the time of the sync. Neither really tell us anything about the remote repository itself. I have also noticed that any sync options provided to the sync API endpoint at the time of the sync, tend to be extremely ephemeral. When using Pulp via Katello (for example), it can be pretty difficult to find out for certain if Katello really has passed some expected sync option to Pulp.

In summary I have two questions for discussion:

  1. Do we have some organizing principle for what should be a remote field, and what should be a sync API endpoint option?
  2. Do we want sync API endpoint options to be ephemeral, or should we systematically log or store them somewhere for debugging?

In some sense question 2 also applies to remotes, since the remotes can be changed after use, or the information what remote was used for a particular sync can be lost when the sync task is deleted (as old tasks eventually tend to be).

See also:

1 Like

I don’t have an answer for you, but i recall that at some point i heard they should live in both places: Persistent on the remote and overrulable on a sync call.

If I had to vote, i would persist them all on the remote. And you need to change that if you want to sync differently. The logging aspect could be done by literal logging. I don’t see the need for this to enter the database.

I don’t disagree that the differences between remote-policy and repository-sync-policy can be a little…opaque. I think of it like this:

The remote-policy describes how reliable you think your Remote is - i.e., if access is sporadic, anything using that Remote needs to get the content while it has a chance, so that Remote is immediate. If it’s reliable, we can use on_demand and only get the content when a client is actually asking for it.

sync-policy describes implications of the repository . “One of my Fedora35 repos mirrors exactly what Fedora has currently, including fedora-signed-metadata (mirror_complete), a different one I want to use my own metadata-key (mirror_content_only), on a third I want to keep every RPM Fedora has ever published into the upstream (additive)”. They all use the same Remote attributes.

The “what happens if policy/sync-options are ephemeral? if they’re persisted, what happens when I change them in the persisted object?” Question 2 remains, of course :slight_smile:

That is interesting, because on-demand vs. immediate to me is more like the tradeoff between disk-usage and network bandwidth. The fact that some remotes are so volatile they must be synced in immediate mode is rather a side note.

Funny question that pops into my mind here: Can we add yet another mode called “redirecting” where pulp will never download the content on its own at all?

Pretty sure this exists and is called streamed.

No, in fact, in streaming mode pulp downloads artifacts on the fly and serves them as the chunks come in, but does not save it to storage.
What i meant is that pulp would issue a redirect to the original upstream.

I am sure there might be reasons to prefer this to the “streamed” mode. It might be difficult for Pulp to provide any kind of guarantee for what file users are actually going to get this way though. One would haveto rely entirely on ones packaging tools “to do the right thing” if the file is corrupted during download or similar. I guess we have that now with the download from Pulp content app to client though, so it might not be that important a point.

Maybe you should open a new thread/issue/discussion on this though, since we went off on a tangent here. :grinning_face_with_smiling_eyes:

A good point, and very true. It also supports why this is on the Remote - “I want to save disk space when Repo-A syncs from Remote-A, but not when Repo-B syncs from Remote-A”, doesn’t make any sense. “Remote-A is going to take 500Gb of storage that I don’t have, it has to be on-demand no matter who uses it”.

We have had some discussion around this in rpm-land - “I want to curate the content that Red Hat makes public, but I don’t want to deliver it from my Pulp instance, I want requests to go through to cdn.redhat.com, so clients end up streaming the content from whatever the closest CDN node is”. It is (as is everything) harder than it appears at first glance.

I think it would be useful to store “the values used” for repo and remote, in the sync-task. I don’t see a convenient way to do that in task-attributes, using existing code we’d maybe have to repurpose, say, a progress_report to have that info (which is a little ugly, but would work). A field on Task, “invocation_attributes” or something, would be better - and would be generally useful, I think. Just some Thoughts In My Brain…

@ggainey Should I create a feature request for “Store sync options on the sync task”?

That is a fine idea! Point back to this thread for details. Thanks!

1 Like