Disable/enable downloads from remote for repositories that are not streamed

rfguimaraes · January 24, 2024, 1:03pm

Problem:
Pulp repositories on immediate or on_demand cannot be easily put into “offline” mode.
More specifically, Pulp will download every requested artifacts (when possible) if it can, and this behaviour does not seem to be easily configured.
Expected outcome:
Pulp repositories in immediate or on_demand should be always able to be put in “offline” mode in which
Pulpcore version:
3.43.1
Pulp plugins installed and their versions:
pulp_deb: 3.1.1
pulp_file: 1.16.0
pulp_container: 2.17.0

Operating system - distribution and version:
GNU/Linux - Ubuntu 22.04 (jammy)
Other relevant data:
This is more of a question than an actual problem. Some tools similar to Pulp (e.g. JFrog’s artifactory) have an easy way to put repositories (mirrors) in “offline” mode, which disables downloading from a remote. I wonder if Pulp has something similar.

This is particularly useful to prevent accidental, large downloads, that could waste a large amount of resources (e.g. network, time). Also, as the setting of online/offline is easy to toggle, temporarily the repository online for acquiring new artifacts and syncing is not a demanding task.

I am also open for suggestions on how to workaround the lack of this feature, if it is not really implemented. For reference, I am setting up pulp using the OCI images.

rfguimaraes · January 24, 2024, 1:50pm

I just noticed that, at least for pulp_deb, it is sufficient to delete the remote to prevent downloads, while still allowing downloaded artifacts to be reached. Moreover, to re-enable downloads from the remote it seems sufficient to reconstruct the remote and sync the repository again.

I test this using the “maximum flexibility variation” of the synchronisation workflow. However, I am not sure if a workaround based on this would have undesired side effects.

quba42 · January 24, 2024, 4:15pm

I understood your question as follows: “Is it possible to control when Pulp will download things from the upstream (remote) repo?” For immediate this is trivial:

If you are syncing using immediate, everything is downloaded “immediately” when you sync. When you are done syncing nothing is ever downloaded from the remote upstream repo again, until you sync again, so it is 100% under your control when you download things from the remote upstream repo.

When you sync on_demand most things are not downloaded at sync time, and are instead downloaded whenever a client asks for that thing from the Pulp content app. The only way I know how to temporarily prevent that from happening, is to stop serving the repo (remove your publication from any distributions), then re-associate them when you want to re-enable downloads. This does not strike me as convenient.

If you want to control when downloads happen I think your best bet is to use immediate and then sync only at those times where you don’t mind downloads happening.

rfguimaraes · January 25, 2024, 7:58am

Thank you! I will try some of these suggestions.

It is quite close to what I intended, but perhaps I should have rephrased it in terms of controlling “what” is downloaded. I would like to avoid downloading unnecessary packages (that is why I was focussing on the “on_demand” policy), either because they will never be requested or because they might be requested by accident (the latter is the main reason for putting mirrors in “offline” mode in my use case).

Indeed, for immediate it is straightforward to control the “download” time. The only problem is that, as far as I understood, the policy would download everything, which is something I want to avoid.

I see, I will try that and consider if it is a cleaner solution than deleting/recreating remotes as I had considered in my own reply.

ipanova · January 25, 2024, 11:41am

We don’t have such a toggle option in Pulp. Please file a feature request on our github tracker.

That would be still half solution, if you want to serve those artifacts that are downloaded and available locally.

quba42 · January 25, 2024, 11:30pm

Definitely more of a workaround, since there is no actual feature to “temporarily” block downloads.

@rfguimaraes I wonder if you could expand a bit more on what you mean by packages being “requested by accident”? Are these cases of clients requesting packages you don’t want them to request? Or perhaps cases of clients requesting packages at an inopportune time (when networking resources are scarce)? Or something else entirely? Do you have a concrete example? I feel like I am still not fully understanding the motivation for your use case.

rfguimaraes · January 26, 2024, 8:32am

Yes, it is this situation:clients requesting packages that we do not want to be downloaded. The idea is that we have a workflow that we can rely on and that will request only the packages we need and exactly those. I had planned to enable downloads from the remote while this workflow runs, and then put the repo offline.
As a more concrete example, suppose that my workflow runs and downloads a bunch of packages, creates a new publication, and updates the distribution.
While that workflow is not running, a client (which also uses the mirrors hosted in my Pulp installation) tries to install new packages, for example, all texlive packages, which have nothing to do with our work.
We would like to prevent this to save space. Putting the mirror offline would allow the clients to retrieve any artifact that has already been downloaded, but prevent new ones from being downloaded from the remote.

I also did some more research and simply periodically reclaiming disk space might be sufficient for my use case.

pedro-psb · January 26, 2024, 1:48pm

Hello @rfguimaraes,
Do you think the idea of distributing a filtered repository fits this usecase?

It looks to me that you only care for a subset of remote Content, so it makes sense to only include those in the repository you want to expose.

I believe there is not a very obvious way of filtering content of a repository, but I’ve learned recently that Katello uses the repository modify (add/remove) API (in pulp_rpm) to achieve this effect. In deb it would be this endpoint.

This is a long shot but I thought it was worth sharing.

rfguimaraes · January 26, 2024, 1:52pm

Thank for this other suggestion @pedro-psb.

This might indeed fit our usecase. I will see how does that fare considering we that we still have to precisely map which packages we actually use.

ipanova · January 29, 2024, 1:41pm

@rfguimaraes You should take advantage of promotion workflow. It nicely allows you to decide what you want to expose. Create 2 repositories, in first repo you will have the content your team relies on. In the second repo you will have content you want to export to the end users. You will need to implement periodic copy from repo1 to repo2 and use latest version as base_version(see link from @pedro-psb ). This way all packages will be copied over.
For repo2 you can set auto-publish, this way you won’t even need to manually update the distribution.

rfguimaraes · January 30, 2024, 12:01pm

Thanks! This approach seems quiet neat