Unable to sync full PyPI repository

cwilliams71600 · September 2, 2022, 8:00pm

Problem: I currently have a repository that I am attempting to sync to a remote for the PyPI repository. I have the remote configured with the --includes blank (i.e. to sync the entire PyPI repo, aside from some specified in the --excludes). Still using on_demand downloading. However, when I sync the repository to the remote, the sync never completes. I left one sync attempt running for over two weeks and the sync still had not completed. I am a bit confused as to why this takes so long, as it should only be syncing the metadata for the PyPI repo (if I understand correctly). Is it normal for this to take this long?

Edit: I have also noticed that the sync always ends up consistently maxing out a single CPU core. I believe that might be why the sync always slows down and never completes, but I am confused as to why the sync uses up so much CPU (and just on a single CPU core).

Expected outcome: A sync of the PyPI repository metadata should complete in a reasonable amount of time (less than a week) and not max out CPU.

Pulpcore version: 3.20.0

Pulp plugins installed and their versions: python 3.7.2

Operating system - distribution and version: Ubuntu 22.04

Other relevant data:

gerrod · September 2, 2022, 9:42pm

Can you inspect the currently running sync task and report the number completed for “Fetching Project Metadata” and “Associating Content”. After a full sync for all of PyPI these two numbers should be around 400,000 and 6.6 million respectively.

If you don’t need all these releases you can significantly speed up a sync by excluding prereleases and setting the max number of packages to sync per project through the remote fields prereleases & keep_latest_packages. See the docs for all the sync filter settings.

The biggest bottle-neck in syncing is saving to the database, especially for large syncs like all of PyPI. Currently pulp_python stores each released file as a Content unit + Artifact, so syncing all of PyPI would create 6.6 million Content units and Artifacts. Downloading all the metadata can take around 2-10 hours depending on your network speed, but sadly saving the Content units is not done in bulk so it can take significantly longer depending on your database speeds.

cwilliams71600 · September 2, 2022, 9:55pm

I don’t have that particular task that I left running for weeks handy at the moment, but I think it only made it to the high 200,000s for “Fetching Project Metadata” after all that time Maybe low 300s. Not sure where it got to for Associating Content.

I was thinking setting keep_latest_packages to a low value might help - I will give that a shot and try excluding the prereleases as well. Thanks for mentioning those.

Hmm. So even when using on-demand downloading, Pulp needs to pull the metadata and all Content units? And the Artifacts are still not pulled until requested by a client.

gerrod · September 2, 2022, 10:13pm

I should clarify, the sync pipeline with on-demand downloading only pulls the metadata and from that metadata it creates the Content units. So for pulp_python it pulls each project’s metadata and that metadata contains information about each version of the project and all the release files for those versions. For each release file it creates a Content unit and Artifact with the Artifact only being downloaded once requested by a client. Getting the metadata is relatively quick. It’s the saving of the Content units from the parsed metadata that is currently slow.

Also, if you don’t need to curate packages from PyPI but just need a local cache I would suggest checking out the pull-through-cache feature. It was designed just for this when syncing takes too long and you don’t plan on individually filtering packages.

cwilliams71600 · September 9, 2022, 3:58pm

Got it. Thank you for clarifying all of that. It is super helpful to know how that all works. I was able to get a sync of the full PyPI repo to complete much quicker with the prereleases excluded and only the latest version of each package included.

We also explored the pull-through cache feature, but we would most likely want to filter out some specific packages. Thank you for the suggestion and again for all of the help!