RFC: Separating Sync and Upload workflows in pulp_deb (and others?)

Backbround

The pulp_deb team had a brain storming session this morning on how to improve Package upload workflows in the plugin.

One of the main things we struggled with, is the can of worms of edge cases and counterintuitive behaviour when some repository has both synchronized and uploaded content in it. E.g.: What happens with the uploaded content when a new sync uses mirrored policy? (The uploaded content is chucked out of the new repo version.) How can optmized sync mode deal with these situations?

In APT world these edge cases are especially abundant, because of how metadata structures packages into “Release-Component-Architecture” groupings. E.g.: Should uploaded content go into the ReleaseComponent’s created via the sync or some kind of special “upload” component or similar? What if a upstream repository already has this component (think Pulp to Pulp syncs)?

One obvious conclusion is that as a matter of best practice, it is almost always best to use separate Pulp repos for synchronizing and uploading. We are now considering whether to take it further:

Proposal: Should we make it impossible to mix uploaded and synced content within a single repository?

The idea would be to have two types of repository in the plugin, one for upload and one for syncs so users simply can no longer mix these use cases.

Since this might be pretty restrictive for some expert users who know exactly what they are doing, we want to get as much input as pssible before we embark down that route.

Questions we are seeking input on:

  1. As a pulp_deb user, can you think of any critical use cases that are incompatible with this approach?
  2. As a pulpcore developer, how hard would it be to create multiple repo types in a plugin, some with sync API endpoint, others without?
  3. As a pulp_rpm developer, how problematic is a mix of mix uploaded and synced content in a single pulp_rpm repo?

I know in pulp_container we have separate “mirror” and “push” repositories for roughly that reason. We keep coming back to the question of whether we can consolidate them.
And yes, when doing a mirror sync, you are expected to get exactly the upstream. As you say the best approach is to use different repositories (not necessarily repo-types) for different use patterns. Maybe we need to be a lot more verbose about the fact that the same content in multiple repositories does not need more storage.

So just off the top of my head:

  • Some users only-sync content, using Pulp as a local mirror(ish)
  • Some users upload “their” content into their own repositories
  • Some users want to present “mixed” repos to their users. The main workflow in that case is “we have sync-repo-a and upload-repo-b, and we create enduser-repo-c by regularly copying all content from repo-a and repo-b into repo-c and distributing repo-c”.

I’m sure there are users who sync content from afar, push new rpms into the result, and distribute that. And, as you note, if they use “mirror_content” on that repo, “their” content is not going to be in the resulting repo-version.

None of this is “problematic”, per se. We haven’t had any pulp_rpm users (that I’m aware of, anyway) complain about having shot themselves in the foot this way. I mean, if you do that last “mirror into a mixed repo” sync, you don’t actually lose ‘your’ content - it’s just in the previous repo-version, and you can even “put it back” using modify-content.

From pulpcore’s POV - without looking at the code, I don’t imagine it would be terribly hard to have repo-type-specific viewsets that override/disallow specific commands. It’d take some work, but I can’t think of anything in pulpcore that would prevent you from doing so.

After some more discussion and brainstorming, I now feel like it is better not to have completely separate repo types, but still to provide some guard rails/heavy nudging using a setting on the repository model:

Something like usage_type which can be one of Null, sync, upload, mixed. A new repo would start with Null by default, and the first sync or upload workflow would set it to sync or upload respectively. If the repo is set to upload for example, it will not allow you to use the sync API endpoint on it, unless you change the setting to mixed first.

Expert users continue to have access to all possible workflows (after explicit opt-in), and we avoid the extra complexity of having multiple repo types. Non-expert users get some guard rails and get pointed at appropriate documentation before they can access the “expert” workflows.

Edit: This idea has obvious and immediate benefits for the verbatim publisher in particular: The user is trying to create a verbatim publication from an upload type repostiory? Better tell them that that does not make any sense! Verbatim publication of a mixed type repo? Better warn the user that any uploaded content will not be part of that publication.

Another possiblity would be, that uploaded packages are somehow marked in pulp that they were uploaded.

Additionally, a new repo flag can be introduced like

  • keep manually uploaded packages during sync
  • remove manually uploaded packages during sync

This sounds like vastly complicated classification. For example how should content be classified that you uploaded it, but the very same package is part of a subsequent sync. I think this created way to much corner cases to handle intuitively for everyone.

Instead of a soft implementation of a repo-type (from none to mixed), i would represent them as separate repository-level feature flags. Even “no-feature” could be useful, marking a repository as readonly.

As soon as you start adding flags, you start adding combinatoric edge cases for “what happens when someone changes the flag”. Content is de-duplicated across repos, and the same sha256 can appear in pulp from multiple sources, sometimes uploaded, sometimes sync’d - how do you flag such content? Do you mark content that has been copied from one repo to another as special as well? What happens if a user sets a repo to “upload”, and then changes it to “sync”, and then back to Null? Does a repo created using modify-content or the rpm Copy command get its own flag/type? What code needs to run at state-changes, and what’s the difference in codepaths based on the current state?

If the only real use-case here is “we want some repos to not support the ‘sync’ command”, then that sounds like a subclass Thing to do, and the user would make that decision once. Or, the user could just not give such a repo an attached/default remote, and name it something like “foo-x86_64-UPLOAD_ONLY”.

At the end of the day, at some point we have to decide that the person responsible for content-management knows what they’re doing. There’s no set of flags that will help if that’s not true.

I wanted to mention our use case: each apt repo we have today gets contents from two sources. The first is our old system–we’re continually syncing in content from it since we plan to allow users to upload content to it during a deprecation period. The second source of packages is user uploads directly into Pulp. So ideally for us, pulp_deb would continue mixed content repos.

That said, we’re hoping to eventually retire the old system and stop syncing content into Pulp (maybe later in 2023). So maybe one option for us is wait to upgrade pulp_deb until we do that. The request then would be that pulp_deb has a viable upgrade path for us to go from a mixed content repo to an uploaded-only repo.

Yes, but if this restriction was in place we would have just designed our syncing solution differently. Sync from old infra into unpublished internal repos, copy over everything into the publishable/uploadable repos. This repo design may have even made our migration lives easier.

One other possible use-case is that someone may want to initially populate a repo with a sync and then exclusively update after that. But again they could just use two separate repos and it’s not that much of a hardship.

I think I have no issue with this design change. I agree with @davidd that there obviously needs to be an upgrade path though.

Have you considered to offload this question to rbac? Creator of the repo could decide what kind of permissions to grant - whether "sync or "modify’

1 Like