Pulp snapshot feature proposal

Pulp repo snapshot

Add snapshot support to Pulp

Introduction

Pulp currently doesn’t have the concept of a repo snapshot. Even though
it does have publications which are essentially snapshots, Pulp doesn’t
serve these publications as snapshots. This proposal introduces the
snapshot support in Pulp by marking specific publications as snapshots
in Pulp and serving them.

Multiple repos have added snapshot support (e.g.
Ubuntu and
Debian repos). This highlights its
importance to customers and in turn repo admins. Some of the quoted
benefits of snapshots are:

  • Enabling reproducible deployments of a set of packages at a
    particular date and time.

  • Determining when a change in behavior occurred in the archive.

  • Supporting a structured update workflow, for example where a
    snapshot is validated in
    one environment before being released to other environments.

Approach

Using Pulp publications as snapshots. Whenever a new publication is
being created, the creator has the option to mark it as a snapshot.

A new distro type “snapshot distro” will be introduced to serve
snapshots of a specific repo. This will handle serving both the
snapshots listing as well as the specific snapshots.

Requirements

P0: Create snapshots for each enabled repo when it’s published with
the snapshot flag.

P0: Snapshots are available using consistent distro paths based on the
publication creation timestamp.

P0: Snapshots are never deleted once created.

P1: Support accessing a snapshot based on an at-or-before timestamp.

P1: List the available snapshots for a specific repo.

P2: Document snapshots availability for customers

P2: Have the snapshots served as directories by year and month (e.g.
<pulp_domain>/snapshot/<repo_name>/<year>/<month>/)

Proposed solution

Creating a snapshot

  • When a publication create is invoked for a repo (i.e. repo publish),
    we can pass a snapshot flag to the publication create API.

  • The snapshot flag will mark the publication that will be created as
    a snapshot publication

  • Repo versions associated with snapshot publications should be
    excluded from retain_repo_versions cleanup.

  • We don’t have to create a distro for this publication as it will be
    served by the repo’s snapshot distro

Serving snapshots:

  • To enable the snapshot feature for a repo, we create a special
    “snapshot” distro.

  • The snapshot will only need the repo id and the snapshot path prefix

  • The actual snapshots path for the repo will be
    snaphot_path_prefix/publication_creation_timestamp

  • When we hit the repo’s snapshot path prefix, Pulp will return all
    the timestamps for the snapshot publications for the repo associated
    with the distro
    curl packages.microsoft.com/snapshot/yumrepos/azure-cli

  • When we hit a snapshot URL (i.e.
    repo_snapshot_path_prefix/timestamp) regardless of whether we have
    the exact timestamp, Pulp will serve the most recent publication
    that was created at or before this timestamp.
    curl
    packages.microsoft.com/snapshot/yumrepos/azure-cli/20240720T120000Z/

Handling at-or-before snapshots

When customers try to access the Pulp snapshots, they should be able to
specify an arbitrary timestamp. The timestamp they specify doesn’t have
to be the exact timestamp for a snapshot/publication. We should return
the snapshot before or at the given timestamp. Assume we have the
following snapshots (notice the timestamps):

And we got a request for
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T080000Z/
we should resolve that to
https://packages.microsoft.com/snapshot/ubuntu/24.04/prod/20240720T060000Z/

Here’s some examples from the Debian snapshot service:

We can either redirect to the exact snapshot timestamp or just serve the
appropriate snapshot, with every approach having its pros and cons. We
can make this configurable by the admins.

Approach Pros Cons
Redirecting CDNs will cache the same snapshot content only once. Almost every snapshot request will hit origin. Origin will redirect to the appropriate snapshot, which will be cached and served by CDN afterwards.
URL rewrite Almost all snapshot requests will be handled by the CDN after the first request. Request for a previously requested snapshot (potentially from the same client) will be handled by CDN without hitting origin CDNs will cache the same snapshot content multiple times based on the customer-specified timestamp for the snapshot.
1 Like

I do something like this with RPM repos

Our main server syncs everyday from the upstream Red hat servers and every day I update the report and create a publication and update a distribution (called Latest)

Every week I can create a publication + distribution as a snapshot for that week named -ddmmyyy

All.my other pulp servers sync off those distributions

Works fine, no real gaps

2 Likes

Thanks for the response. We have considered such a workflow and probably will implement a solution along those lines if adding this feature to Pulp is not an option. But there are some things that make implementing this outside of Pulp more challenging:

  • The workflows are more complicated. Some examples:
    • Instead of just calling publish, we’d have to publish, wait for and monitor the publish task, and then create the distribution.
    • If we want to delete a snapshot publication, we’d have remember to delete the publication and the distro.
    • If we want to change the base path for a set of snapshots, we’d have to update a large number of distros.
  • We have hundreds of repos and we’d end up with thousands of snapshot distros to manage.
  • We’d like to protect snapshot publications from repo version cleanup.
  • We’d also like the ability to redirect/rewrite requests to fuzzy match timestamps like Ubuntu and Debian do.

It does seem like snapshots are a common use case in the Pulp community–I’ve come across other examples of codes and discussions–so it seems like this might be useful for this to be in Pulp but I’m interested in alternatives and suggestions. Like for one thing perhaps there are some ways to add features to Pulp to fill the gaps I’ve mentioned without implementing this entire snapshot feature in Pulp.

I have created this draft PR to capture my PoC, Feedback is welcome.