Publication vs composing

I’d like to discuss whether Publication in Pulp terms can be mapped on Composing action as in Fedora, CentOS, RHEL terminology.

I am looking for possibilities to

  1. decouple the compose process from the rpm build system,
  2. have compose-related artifacts organized in a meaningful way, so that one can use higher-level methods for discovery, search, diff, rebase and merge them, rather than work with the hardcoded directory structure.

But let me set the scene first.

Context

How Pulp Publication works (as far as i understood)

There is a certain set of content units. We “put”(link) content units into a Repository.
Then we create a Repository Version = immutable snapshot of the state of the Repository at a certain point in time.
The we apply Publication to it.

Publication is a procedure which takes Repository Version as input and generates additional artifacts from it (it can be rpm metadata, directory hierarchy, html pages,…).
This artifacts are then stored in Pulp as separate Content Units(?)

And then we may expose the Publication via Distribution object.

How Fedora/CentOS/RHEL compose works

We have pool of rpm builds (rpm build is a group of several rpms built from a single SRPM, think sub packages and arch-dependent packages) represented by a Koji tag.
We make a snapshot of this pool at a certain time (tracked by koji event object) and then we run a compose.

Compose has multiple phases:

  • It takes the pool of rpm builds and splits packages into arch-specific repositories.
  • It takes the repository of all packages for this architecture and then applies filtering and dependency resolution to produce the layered repository structure, so that the next layer depends only on the previous one.
  • It then triggers image builds and container builds which produce iso, qcow, … artifacts
  • It uploads those artifacts back to Koji Hub
  • And then there is a completely separate procedure to take data from Koji Hub and publish it to mirrors for distribution.

Ideas

1. Compose as Publication

  • RPM build is a Content Unit
  • Koji tag is a Repository
  • Koji tag per event is a Repository Version
  • Compose is a Publication

The concern here is that Publication becomes a very heavy step - it generates metadata, it also calls for external services to run image builds and it needs to own new types of artifacts.

Question: Can it really do that? Are there limitations to what Publication can do and which content units it can contain? Can it run several tasks in parallel?

2 Repository groups? Links?

It can also be that Publication only covers the first two phases of pungi: generate repositories structure. While building of the secondary artifacts would be orchestrated by the external service.

This then leads to the other question see https://github.com/pulp/pulpcore/issues/3710

When I am producing the secondary artifacts from a certain publication, I want to preserve the link between those artifacts and the Repository Version and Publication which I used as inputs.

So while I can setup a separate dedicated Repository to store the compose artifacts, I lose the “native” connection between them, and I will have to add it via some custom ways (metadata files?).

So if Compose doesn’t work in the Publication context, can it get the relevant abstraction on its own?

Composing as a way to produce a linked repository

I think it can be considered a common pattern in CI that we have two different kinds of artifacts:

  1. Primary artifacts - (pool of rpm builds in the Fedora case)
  2. Secondary or Derivative artifacts - things which we produce from the primaries (rpm repos and images in our example).

The repository of primary artifacts represents the place, where the change happens: we upload new content units to it directly.

The derivative artifacts are produced(composed) from snapshots of primary artifacts.

The key issue is that when I am validating the update of the primary artifact and decide whether or not I would like to promote it, I need to use derivative artifacts to make a decision.

Therefore it is critical to maintain the two-directional link between a derivative artifact and the primary one.

Compose object could be a new abstraction which represents the “Publication on steroids”, which has a proper Repository object associated to it. And which maintains the link between a Primary Repository Version and the derivative Repository Version.

What do you think?

Hello @bookwar, thanks for this discussion! I’m writing up a response, it has just taken me a bit to gather my thoughts :slight_smile:

2 Likes

How Pulp Publication works (as far as i understood)

Yes, correct (mostly). I will add that the HTML index pages are handled at request time rather than pre-generated, but they could in theory be pre-generated using the same mechanism if we wanted to.

Artifacts and content units are related but independent, but for the sake of talking about RPM packages specifically, the relationship is 1-to-1. So you have the metadata associated with one RPM package, and one artifact (which is a handle to a checksum-addressed file) associated with that.

How Fedora/CentOS/RHEL compose works

Sounds very similar to Flat Manager [0], which makes sense, since that is also used for handling “builds” of related artifacts.

[0] Introducing flat-manager – Alexander Larsson

Can you exand a bit on the 2nd point about layered repository structures? Also, where can I find out more about the full schema / metadata available for a Koji “event” or “rpm build” or “tag”?

  1. Compose as Publication

I think the most difficult part would be resolving the impedance mismatch around singular RPMs vs RPM builds. The model as it currently exists is centered around singular RPM packages and retrofitting the concept of an RPM build on top of that might be challenging in a similar way to how modules are challenging.

There may be a slight misunderstanding though, Publications are basically agnostic to the content unit type. It gets boiled down to a collection of PublishedArtifact and PublishedMetadata, which are effectively just a pointer to the object store plus a relative path within the repo - not so different from “hardcoded directory structure”. It doesn’t know anything about content types. Repositories (/ repo versions) do, but they don’t know about the final layout.

In terms of what a publication can do, I suppose it can do anything really, but I’m not sure we would want to have the code used to support that use case running for all users, so probably we would want to chain independently written tasks together instead. But, the fact that the context is erased during creation of a Publication might be a problem for you? It would have to be pieced back together and you’d be kind of back at square one.

2 Repository groups? Links?

Repository groups are definitely a good idea and something that we might want for other reasons too (see https://github.com/pulp/pulpcore/issues/1969).

But the how is still a very open question. Creating tie-ins between plugins that ought to be independent (and are independently versioned relative to each other and pulpcore) is a challenge. I think as Matthias suggested it could be done with stringly-typed labels on objects, but I’m not sure about stronger linkages.

Compose object could be a new abstraction which represents the “Publication on steroids”, which has a proper Repository object associated to it. And which maintains the link between a Primary Repository Version and the derivative Repository Version.

Well the publication does know what repository version it came from. As for maintaining a link between the primary and derivative repository version, the main thing I’m struggling with is that we (probably) expect a full churn on those derivative artifacts every rebuild, which basically sounds like a publication. But - we explicitly want to maintain extra context about that stuff, so putting it directly in a publication doesn’t help. It feels a bit awkward either way.

It kinda feels like if we want to be part of a general build system we ought to shift towards an architecture similar-ish to [0]

2 Likes

@bookwar thoughts?

How Fedora/CentOS/RHEL compose works

Sounds very similar to Flat Manager [0], which makes sense, since that is also used for handling “builds” of related artifacts.

[0] Introducing flat-manager – Alexander Larsson

Indeed, this seems to align with the pattern of primary and derivative artifacts I have in mind. I think we see it often but most of the time it is tied to a specific implementation of the artifact and the deliverable. Ideally I would like to see it adopted as a more generic concept so that it provides flexibility around the implementation details of the artifacts used as sources and as deliverables.

Like what is one day we decide to build snaps and flatpacks from the same source? Or if we would like to generate Image Builder images from rpms and python wheels.

Can you exand a bit on the 2nd point about layered repository structures?

I was referring to the separation of the flat pool of all rpm builds into BaseOS/, AppStream/ and CRB/ components as it is seen in Index of /development/latest-CentOS-Stream/compose

BaseOS part of the compose has all its runtime dependencies included, while AppStream repo may depend on packages from BaseOS. HighAvailability component can depend on BaseOS and AppStream, and so on.

So in the end compose produces not a single repository but all of these components for each of the supported architectures.

This is the outcome of the GATHER phase of the compose Tree - pungi - Pagure.io

Also, where can I find out more about the full schema / metadata available for a Koji “event” or “rpm build” or “tag”?

Full disclaimer: I am not the expert in Koji internals, so for a deep dive we probably need to reach out to some other folks.

The docs explain it on a high level Koji HOWTO — Koji 1.32.1 documentation

For very basics you can query the Koji API directly and get the Json data associated with the object. See Playing with public API of Stream Koji · GitHub

There are also sources: Tree - koji - Pagure.io But be warned, it is a single file and it doesn’t seem to have an abstract “data model”, it mostly operates with records in the database.


  1. Compose as Publication

I think the most difficult part would be resolving the impedance mismatch around singular RPMs vs RPM builds. The model as it currently exists is centered around singular RPM packages and retrofitting the concept of an RPM build on top of that might be challenging in a similar way to how modules are challenging.

If I understand correctly, implementation of the rpm model is done on the plugin side not in the pulp core?

My idea was not to reuse the current rpm plugin but to do a new plugin with a new type of the content unit specifically for the rpm builds. And then if it works, the rpm package type could be implemented as rpm build with a single binary if needed, but could also be just left alone.

There may be a slight misunderstanding though, Publications are basically agnostic to the content unit type. It gets boiled down to a collection of PublishedArtifact and PublishedMetadata, which are effectively just a pointer to the object store plus a relative path within the repo - not so different from “hardcoded directory structure”. It doesn’t know anything about content types. Repositories (/ repo versions) do, but they don’t know about the final layout.

In terms of what a publication can do, I suppose it can do anything really, but I’m not sure we would want to have the code used to support that use case running for all users, so probably we would want to chain independently written tasks together instead. But, the fact that the context is erased during creation of a Publication might be a problem for you? It would have to be pieced back together and you’d be kind of back at square one.

I am not sure what kind of context you mean? My assumption was that I can have multiple Publications for one Repository. So Publications are tied to Repository Versions and are immutable in the same sense as RepoVersion is immutable.

I can create new publication for a new RepoVersion, but I can not modify the Publication which has been created. Is it correct?

2 Repository groups? Links?

Repository groups are definitely a good idea and something that we might want for other reasons too (see https://github.com/pulp/pulpcore/issues/1969).

But the how is still a very open question. Creating tie-ins between plugins that ought to be independent (and are independently versioned relative to each other and pulpcore) is a challenge. I think as Matthias suggested it could be done with stringly-typed labels on objects, but I’m not sure about stronger linkages.

Compose object could be a new abstraction which represents the “Publication on steroids”, which has a proper Repository object associated to it. And which maintains the link between a Primary Repository Version and the derivative Repository Version.

Well the publication does know what repository version it came from. As for maintaining a link between the primary and derivative repository version, the main thing I’m struggling with is that we (probably) expect a full churn on those derivative artifacts every rebuild, which basically sounds like a publication. But - we explicitly want to maintain extra context about that stuff, so putting it directly in a publication doesn’t help. It feels a bit awkward either way.

It kinda feels like if we want to be part of a general build system we ought to shift towards an architecture similar-ish to [0]

I wouldn’t call it a build system :slight_smile: I would say more like generic artifact storage solution for CI/CD workflows. Building is the step in the process, but it is essentially the artifact management system.

Do you feel like this development is in scope for Pulp project? It doesn’t necessarily mean it has to be in the pulpcore component, it maybe a new kind of pulp-manager component, but I wonder if we can work on this under the Pulp umbrella?

2 Likes

Correct

I wouldn’t call it a build system :slight_smile: I would say more like generic artifact storage solution for CI/CD workflows. Building is the step in the process, but it is essentially the artifact management system.

Do you feel like this development is in scope for Pulp project? It doesn’t necessarily mean it has to be in the pulpcore component, it maybe a new kind of pulp-manager component, but I wonder if we can work on this under the Pulp umbrella?

I think it’s in-scope, yes. I can see something like this potentially working. It would definitely be good to keep the initial investigation very experimental and find the pain points first.

1 Like

I think it’s in-scope, yes. I can see something like this potentially working. It would definitely be good to keep the initial investigation very experimental and find the pain points first.

Hi! I work with @bookwar and wanted to provide some of my feedback as well.

With the current design of Repositories, Versions, Publications, Distributions etc. it wasn’t easy to understand how to properly utilize these concepts. I went through Workflows — Pulp RPM Support 3.21.0 documentation several times but still struggled how to do certain tasks (i.e. remove RPMs from Repository, how to properly “overlay” or “merge” repositories or which steps to take to efficiently make a new distribution).

More documentation would certainly help me, but at the same time I wonder if some of the high-level concepts should be built into the API. It should certainly be considered in the design of this new Compose plugin.