Pulp python repository sync for ages

Hi folks,

Problem:
I try create a mirror of pypi.org for a while, but the repository sync task takes ages.
There are 3 subtasks in running mode:

  • Fetching Project Metadata: 23529 done
  • Associating Content: 314000 done
  • Downloading Artifacts: 0 done

I guess the “Downloading Artifacts” will remain at 0 as I have set the Remote policy to " on_demand".

Expected outcome:
complete metadata sync in less than 1 hour

Pulpcore version:
3.49.0

Pulp plugins installed and their versions:
Python 3.11.0

Operating system - distribution and version:
official docker image running on Ubuntu 22.04 LTS

Other relevant data:

  • straight connection (corporate), noproxy.
  • firewall ok. I could pip install from the official pypi repository.
  • Pgsql volume is slightly and slowing growing, so I do think something is happening, traffic is around 100k/s-500k/s
  • this container serves other Ubuntu repositories with the DEB pulp plugin and it works great and faster.

What is the unit of the “done” value?

thank you everyone!

This is a known problem with Pulp and trying to sync very large repositories like PyPI. For pulp_python we create a database entry for each release file of every package, which is currently around 10 million for all of PyPI. Creating this amount of entries in the database is very slow and thus the db is the big bottleneck for large syncs. If you don’t plan on using all of PyPI, I would recommend setting up pull-through caching so that your most commonly used packages are cached by Pulp and you don’t have to sync all of PyPI. [0] Otherwise you could also use the --includes field on the remote to specify specific packages you want for a smaller sync. [1]

[0] Publish and Host — Pulp python Support 3.11.0 documentation
[1] Synchronize a Repository — Pulp python Support 3.11.0 documentation

2 Likes

Thank you Gerrod for your fast response.

Add remote to distribution to enable pull-through caching
pulp python distribution update --name foo --remote bar

I didn’t know this trickery, sounds a bit hackish, doesn’t it?
Anyway, it works, fetching the package and its deps:

pip install -i http://myserver/pulp/content/pypi/pypi pygments

though it complains that the HTML index page is not a proper HTML 5 document (violating PEP 503). It might fail in the next pip release.
FYI, I used the remote url https://pypi.org

EDIT: actually I added wrongly an extra pypi path, based on the distribution’s output but it is wrong. so, with http://myserver/pulp/content/pypi/simple no more issue!
Many thanks! We can state that this thread is SOLVED.

4 Likes

Outstanding - thanks for reporting back!