De-duping Pulp Content Views

Hi there,

I have a question about how Pulp content looks when the same content-view is promoted to multiple lifecycle environments.

I have the following LEs: Eng>RC>UAT>Dev>Test>PAT>Prod (quite a few)

I create a new version of my cv-core, publish it and promote it into each of those LEs.

The content is unchanged in each promotion, but it looks to me that Pulp is generating a new copy of cv-core in each LE.

First up, am I right about this? It certainly look that way from the server workload when I compare promotions into just Eng versus promotions into all LEs.

Secondly (depending on the answer to Q1), is it not possible to link the version of cv-core in say UAT back to the initial version promoted to Eng? Copy on Write type of thing? Saving having to create an entire new copy of a cv which is identical.

I’m asking because my content is synched out from a central Pulp in Katello out to 26 global smart-proxies and the 7 LEs mentioned above. The workload on each of the smart-proxies is massive during each release cycle. We wondered if there is anything we can do to tune this to help the smart-proxies out.

Thanks

Duncan

Hey Duncan,

This is more a question for “how katello manages content-views” - Pulp doesn’t “know anything” about how Katello organizes lifecycle-environments or content-views. We can weigh in on pulp’s pieces.

Katello’s CVs are all individual repositories in Pulp. The content itslef is de-duplicated on the “home” katello - but from a smart-proxy’s POV, it has no way of knowing that. So when Pulp on a SP is told “sync these 27 repositories”, it pulls the content down. It doesn’t know it’s “the same” in each repo, until it has the binary blobs in hand and has successfully checksummed it - then it can reliably say “oh I already have this one”.

Making that process “smarter” would involve handshakes between the smart proxies and katello, and a different way of defining what a CV is inside of katello, which would be a big change, I think.

I don’t know if this helps any- you may get more focused answers for your specific scenario from TheForeman’s Discourse, which includes the Katello Gang.

2 Likes

Definitely a Katello question that would be more at home on the Foreman discourse.

I do know something about how Katello handles promoting the same content view version to multiple lifecycle environments, so maybe I can just answer the question:

Each new content view version creates a new Pulp repository and copies content from the library instance Pulp repository to the content view version Pulp repository. This is necessary to provide features like content view filters and roll-back to old content view versions.

However for lifecylce environments it is different. Katello does not create a new Pulp repository for those. It merely creates a new Pulp distribution (Which is essentially just a reserved path on the Pulp content app) and then re-uses the publication/repository it already created for the content view version for that new Pulp distribution. That way, if you have 7 different lifecycle environments that all have the same content view version promoted to them, you will see 7 different repository URLs on your Katello instance, but all 7 of them will be serving the exact identical Pulp objects. So this is not duplicated in Pulp.

However, like @ggainey said, the smart proxies don’t know that all of those URLs are serving the same content, so if you sync 7 lifecycle environments to your smart proxies, that is still going to be 7 separate syncs into 7 different Pulp repos on the smart proxy, for each repo that is in the 7 lifecycle environments. (Even if exactly the same state is in all of them).

One question I have, is why you have 7 different environments if they all have the same content view version promoted to them anyway? Or did I misunderstand that point?

3 Likes

Thanks for both answers. Makes more sense to me now. I thought Pulp was more content-view manager, but you straightened out my thinking.

We are loading some historic content (the last 3 monthly releases) into the repos at the moment. Hence promoting the same CV through all environments. So it’s definitely not a regular thing. Normal practice is to do quite a lot of promotions of new CV to Eng while we work on a release, then promote a specific version to RC & UAT for testing. If that is successful, we promote to the end-users environments which are Dev & Test initially, then a week later to PAT and Prod.

Thanks for sorting me out

On a Pulp-related note then (something which I keep getting asked). Is it possible to get Pulp to only put the latest version of packages into a repo?

So if I’ve got 12 historic versions of a package in a repo, could I flag the repo as “latest-only” and have all the older versions of packages removed? Not to save space on the server, but clients build up a lot of cruft from dnf caches. Would be nice to keep those numbers as low as feasible.

For RPM packages (yum content in Katello, pulp_rpm Pulp plugin) this feature exists on the Pulp side.

Katello will only let you see and use it if you set the “Mirroring Policy” for your yum type Katello repo to “Additive”. Once you have done that you will see the “Retain package versions” option on the Katello repository page. You can set this to 1 for “only keep one (latest version) of each package”.

This feature does not exist for .deb packages.

1 Like