Remote fields vs sync options

quba42 · August 18, 2022, 2:09pm

Currently we either store what might be termed “sync options” on the remote, or else they can be specified on the sync API endpoint. Now it is not entirely obvious to me why for example remote.policy (can be one of immediate, on_demand, streamed) is a remote option, but for example the pulp_rpm sync_policy (can be one of additive, mirror_complete, mirror_content_only) can only be specified at the time of the sync. Neither really tell us anything about the remote repository itself. I have also noticed that any sync options provided to the sync API endpoint at the time of the sync, tend to be extremely ephemeral. When using Pulp via Katello (for example), it can be pretty difficult to find out for certain if Katello really has passed some expected sync option to Pulp.

In summary I have two questions for discussion:

Do we have some organizing principle for what should be a remote field, and what should be a sync API endpoint option?
Do we want sync API endpoint options to be ephemeral, or should we systematically log or store them somewhere for debugging?

In some sense question 2 also applies to remotes, since the remotes can be changed after use, or the information what remote was used for a particular sync can be lost when the sync task is deleted (as old tasks eventually tend to be).

See also:

x9c4 · August 18, 2022, 3:09pm

I don’t have an answer for you, but i recall that at some point i heard they should live in both places: Persistent on the remote and overrulable on a sync call.

If I had to vote, i would persist them all on the remote. And you need to change that if you want to sync differently. The logging aspect could be done by literal logging. I don’t see the need for this to enter the database.

ggainey · August 19, 2022, 7:11pm

I don’t disagree that the differences between remote-policy and repository-sync-policy can be a little…opaque. I think of it like this:

The remote-policy describes how reliable you think your Remote is - i.e., if access is sporadic, anything using that Remote needs to get the content while it has a chance, so that Remote is immediate. If it’s reliable, we can use on_demand and only get the content when a client is actually asking for it.

sync-policy describes implications of the repository . “One of my Fedora35 repos mirrors exactly what Fedora has currently, including fedora-signed-metadata (mirror_complete), a different one I want to use my own metadata-key (mirror_content_only), on a third I want to keep every RPM Fedora has ever published into the upstream (additive)”. They all use the same Remote attributes.

The “what happens if policy/sync-options are ephemeral? if they’re persisted, what happens when I change them in the persisted object?” Question 2 remains, of course

x9c4 · August 20, 2022, 8:09am

That is interesting, because on-demand vs. immediate to me is more like the tradeoff between disk-usage and network bandwidth. The fact that some remotes are so volatile they must be synced in immediate mode is rather a side note.

Funny question that pops into my mind here: Can we add yet another mode called “redirecting” where pulp will never download the content on its own at all?

quba42 · August 21, 2022, 5:28pm

Pretty sure this exists and is called streamed.

x9c4 · August 22, 2022, 6:54am

No, in fact, in streaming mode pulp downloads artifacts on the fly and serves them as the chunks come in, but does not save it to storage.
What i meant is that pulp would issue a redirect to the original upstream.

quba42 · August 22, 2022, 7:02am

I am sure there might be reasons to prefer this to the “streamed” mode. It might be difficult for Pulp to provide any kind of guarantee for what file users are actually going to get this way though. One would haveto rely entirely on ones packaging tools “to do the right thing” if the file is corrupted during download or similar. I guess we have that now with the download from Pulp content app to client though, so it might not be that important a point.

Maybe you should open a new thread/issue/discussion on this though, since we went off on a tangent here.

ggainey · August 22, 2022, 6:05pm

A good point, and very true. It also supports why this is on the Remote - “I want to save disk space when Repo-A syncs from Remote-A, but not when Repo-B syncs from Remote-A”, doesn’t make any sense. “Remote-A is going to take 500Gb of storage that I don’t have, it has to be on-demand no matter who uses it”.

We have had some discussion around this in rpm-land - “I want to curate the content that Red Hat makes public, but I don’t want to deliver it from my Pulp instance, I want requests to go through to cdn.redhat.com, so clients end up streaming the content from whatever the closest CDN node is”. It is (as is everything) harder than it appears at first glance.

ggainey · August 23, 2022, 3:04pm

I think it would be useful to store “the values used” for repo and remote, in the sync-task. I don’t see a convenient way to do that in task-attributes, using existing code we’d maybe have to repurpose, say, a progress_report to have that info (which is a little ugly, but would work). A field on Task, “invocation_attributes” or something, would be better - and would be generally useful, I think. Just some Thoughts In My Brain…

quba42 · August 25, 2022, 7:38am

@ggainey Should I create a feature request for “Store sync options on the sync task”?

ggainey · August 25, 2022, 11:13am

That is a fine idea! Point back to this thread for details. Thanks!

quba42 · August 25, 2022, 12:00pm

github.com/pulp/pulpcore

Store sync options on the sync task

opened 11:56AM - 25 Aug 22 UTC

quba42

Feature Triage-Needed

**Is your feature request related to a problem? Please describe.** This ticket …follows on from a [discourse discussion](https://discourse.pulpproject.org/t/remote-fields-vs-sync-options/568), in particular, see [this comment](https://discourse.pulpproject.org/t/remote-fields-vs-sync-options/568/9?u=quba42). In short: It can be very difficult to tell what sync options were originally sent with some sync API call after the fact. This is especially true for sync options that are only set at the time of the sync API call, e.g.: pulp_rpm's [mirror, sync_policy, skip_types, and optimize](https://docs.pulpproject.org/pulp_rpm/restapi.html#tag/Repositories:-Rpm/operation/repositories_rpm_rpm_sync) fields. However, the problem also applies to remote options, since while the remote used is already saved on the sync task, there is no easy way of knowing how the remote was modified after it was used. Example use case: Let's say I suspect that Katello may not have sent the right sync options to Pulp, but I have no way of knowing what Pulp received without adding extra debug logging and "catching it in the act". **Describe the solution you'd like** Some or all of the sync options (up for discussion if they are all needed), could be stored on the sync task for later reference. This could help with debugging, but also just to understand what that particular sync task was all about. **Describe alternatives you've considered** Sync task just strikes me as the obvious place, but I am not actually attached to this design. If somebody has a better idea I am all ears. So long as the information can be easily retrieved. Full disclosure: In pulp_deb we now store roughly the information requested by this issue in the [`RepositoryVersion.info` field](https://github.com/pulp/pulpcore/blob/609ad733b75198463291f7c45598f0af29d37d11/pulpcore/app/models/repository.py#L614). However, that field is not exposed to users, and this was never intended as anything other than `optimize` sync feature plumbing.