Worker timeout and 'retain_repo_versions'

Problem:
A while back I set retain_repo_versions for all our repos. Since then we frequently run into WORKER TIMEOUT when we run our daily job which copies content between repos. I suspect that this is related since we did not have any such issues before.

I’ve seen and applied the patch

which seemed relevant, but the error still appears.
Any ideas?

Expected outcome:
No timeouts

Pulpcore version:
python3-pulpcore-3.28.10-7.el9.noarch (with patch applied)

Pulp plugins installed and their versions:
python3-pulp-rpm-3.22.3-1.el9.noarch

Operating system - distribution and version:
RHEL 9.2

Other relevant data:
worker log:

Nov 09 05:26:46 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.tasking.tasks:INFO: Starting task 018bb28c-135d-7785-bf7c-01f5658bb9ee
Nov 09 05:27:20 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulp_rpm.app.depsolving:INFO: Writing solver debug data to /var/tmp/pulp/018bb28c-135d-7785-bf7c-01f5658bb9ee
Nov 09 05:27:25 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.app.models.repository:INFO: Deleting repository version <Repository: rhel8-epel; Version: 488> due to version retention limit.
Nov 09 05:27:33 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.tasking.tasks:INFO: Task completed 018bb28c-135d-7785-bf7c-01f5658bb9ee

api log:

Nov 09 05:27:15 gunicorn[2899920]: 127.0.0.1 - admin [09/Nov/2023:05:27:15 +0000] "GET /pulp/api/v3/content/rpm/packages/?repository_version=%2Fpulp%2Fapi%2Fv3%2Frepositories%2Frpm%2Frpm%2F10e51ae6-65c7-42aa-8ab1-ffebdf752500%2Fversions%2F841%2F&limit>
Nov 09 05:27:16 gunicorn[2899917]: 127.0.0.1 - admin [09/Nov/2023:05:27:16 +0000] "GET /pulp/api/v3/repositories/rpm/rpm/?name=rhel8-epel HTTP/1.1" 200 694 "-" "HTTPie/3.2.1"
Nov 09 05:28:47 gunicorn[2899914]: [2023-11-09 05:28:47 +0000] [2899914] [CRITICAL] WORKER TIMEOUT (pid:2899918)
Nov 09 05:28:48 gunicorn[2899914]: [2023-11-09 05:28:48 +0000] [2899914] [WARNING] Worker with pid 2899918 was terminated due to signal 9

Pulp starts a task to delete a repository version for the rhel8-epel repo, and then my script starts a content search in this repo which causes a worker timeout.

Our WORKER_TTL is set to 90s.

Hey, I think this really is not the same issue. But I can imagine, that deleting the old repository version (as per retain_repo_versions) is possibly holding a lot of posgres row locks on the RepositoryContent table in a longtime transaction, blocking all concurrent read access.
Maybe we need to find a way to take the repo_version offline without deleting it right away. Then update the repository content record in chunks and then finally delete it.

In any case, it’d be worth writing an issue. Bonus points for describing a minimal reproducer.

We’ve also had a report where deleting a bunch of repo-versions/content at once results in the worker getting a visit from the OOMKiller. Do your logs show any traces of such a thing?

In general, it definitely looks like we need to look more closely at optimizing the delete-path for time and memory.

1 Like

We’ve also had a report where deleting a bunch of repo-versions/content at once results in the worker getting a visit from the OOMKiller. Do your logs show any traces of such a thing?

No, no sign of any OOM-kills in my case.

1 Like