Pulp repo snapshot
Add snapshot support to Pulp
Introduction
Pulp currently doesn’t have the concept of a repo snapshot. Even though
it does have publications which are essentially snapshots, Pulp doesn’t
serve these publications as snapshots. This proposal introduces the
snapshot support in Pulp by marking specific publications as snapshots
in Pulp and serving them.
Multiple repos have added snapshot support (e.g.
Ubuntu and
Debian repos). This highlights its
importance to customers and in turn repo admins. Some of the quoted
benefits of snapshots are:
-
Enabling reproducible deployments of a set of packages at a
particular date and time. -
Determining when a change in behavior occurred in the archive.
-
Supporting a structured update workflow, for example where a
snapshot is validated in
one environment before being released to other environments.
Approach
Using Pulp publications as snapshots. Whenever a new publication is
being created, the creator has the option to mark it as a snapshot.
A new distro type “snapshot distro” will be introduced to serve
snapshots of a specific repo. This will handle serving both the
snapshots listing as well as the specific snapshots.
Requirements
P0: Create snapshots for each enabled repo when it’s published with
the snapshot flag.
P0: Snapshots are available using consistent distro paths based on the
publication creation timestamp.
P0: Snapshots are never deleted once created.
P1: Support accessing a snapshot based on an at-or-before timestamp.
P1: List the available snapshots for a specific repo.
P2: Document snapshots availability for customers
P2: Have the snapshots served as directories by year and month (e.g.
<pulp_domain>/snapshot/<repo_name>/<year>/<month>/)
Proposed solution
Creating a snapshot
-
When a publication create is invoked for a repo (i.e. repo publish),
we can pass a snapshot flag to the publication create API. -
The snapshot flag will mark the publication that will be created as
a snapshot publication -
Repo versions associated with snapshot publications should be
excluded from retain_repo_versions cleanup. -
We don’t have to create a distro for this publication as it will be
served by the repo’s snapshot distro
Serving snapshots:
-
To enable the snapshot feature for a repo, we create a special
“snapshot” distro. -
The snapshot will only need the repo id and the snapshot path prefix
-
The actual snapshots path for the repo will be
snaphot_path_prefix/publication_creation_timestamp -
When we hit the repo’s snapshot path prefix, Pulp will return all
the timestamps for the snapshot publications for the repo associated
with the distro
curl packages.microsoft.com/snapshot/yumrepos/azure-cli -
When we hit a snapshot URL (i.e.
repo_snapshot_path_prefix/timestamp) regardless of whether we have
the exact timestamp, Pulp will serve the most recent publication
that was created at or before this timestamp.
curl
packages.microsoft.com/snapshot/yumrepos/azure-cli/20240720T120000Z/
Handling at-or-before snapshots
When customers try to access the Pulp snapshots, they should be able to
specify an arbitrary timestamp. The timestamp they specify doesn’t have
to be the exact timestamp for a snapshot/publication. We should return
the snapshot before or at the given timestamp. Assume we have the
following snapshots (notice the timestamps):
-
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T000000Z/
-
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T060000Z/
-
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T120000Z/
And we got a request for
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T080000Z/
we should resolve that to
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T060000Z/
Here’s some examples from the Debian snapshot service:
-
Actual snapshot:
https://snapshot.debian.org/archive/debian/20241201T025825Z/ -
Arbitrary timestamp:
https://snapshot.debian.org/archive/debian/20241201T025830Z/
Notice that this one redirects to the actual snapshot.
We can either redirect to the exact snapshot timestamp or just serve the
appropriate snapshot, with every approach having its pros and cons. We
can make this configurable by the admins.
Approach | Pros | Cons |
---|---|---|
Redirecting | CDNs will cache the same snapshot content only once. | Almost every snapshot request will hit origin. Origin will redirect to the appropriate snapshot, which will be cached and served by CDN afterwards. |
URL rewrite | Almost all snapshot requests will be handled by the CDN after the first request. Request for a previously requested snapshot (potentially from the same client) will be handled by CDN without hitting origin | CDNs will cache the same snapshot content multiple times based on the customer-specified timestamp for the snapshot. |