Worker timeout and 'retain_repo_versions'

wiad · November 9, 2023, 8:34am

Problem:
A while back I set retain_repo_versions for all our repos. Since then we frequently run into WORKER TIMEOUT when we run our daily job which copies content between repos. I suspect that this is related since we did not have any such issues before.

I’ve seen and applied the patch

github.com/pulp/pulpcore

Need a way to force async execution of a DELETE task

opened 10:40AM - 08 Sep 23 UTC

closed 02:48PM - 08 Sep 23 UTC

sskracic

Issue Triage-Needed

**Version** pulpcore-3.28.10 pulp-rpm-3.22.3 **Describe the bug** Deleting… a handful of repositories with large RPM content now takes hours instead of minutes. **To Reproduce** 1. sync RHEL 8 baseos and appstream repos from CDN 2. try to delete them 3. since all delete calls are now synchronous, the deletion is performed in a singlethreaded fashion, instead of being distributed across available pulp workers (8 or 16, depending on the installation). It now takes hours instead of couple of minutes. In addition, during this time the API gunicorn worker is completely blocked and is unavailable to process other requests. **Expected behavior** I would like an API addition (eg. a flag ?async=true) that would force my DELETE /repositories/rpm/rpm/ call to be asynchronous and distributed to either an available worker or, failing that, put in the pending queue. **Additional context** Performance of a repository deletion invocation operation (not the actual time it takes to really delete the repo) is very important for RHUI ops. Sometimes a repo needs to be deleted and re-created to get rid of the repository version history and its associated artifacts. More importantly, this performance is vital for the RHUI automation and CI during development.

which seemed relevant, but the error still appears.
Any ideas?

Expected outcome:
No timeouts

Pulpcore version:
python3-pulpcore-3.28.10-7.el9.noarch (with patch applied)

Pulp plugins installed and their versions:
python3-pulp-rpm-3.22.3-1.el9.noarch

Operating system - distribution and version:
RHEL 9.2

Other relevant data:
worker log:

Nov 09 05:26:46 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.tasking.tasks:INFO: Starting task 018bb28c-135d-7785-bf7c-01f5658bb9ee
Nov 09 05:27:20 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulp_rpm.app.depsolving:INFO: Writing solver debug data to /var/tmp/pulp/018bb28c-135d-7785-bf7c-01f5658bb9ee
Nov 09 05:27:25 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.app.models.repository:INFO: Deleting repository version <Repository: rhel8-epel; Version: 488> due to version retention limit.
Nov 09 05:27:33 pulpcore-worker[3107547]: pulp [aa148ca63ff641cbb90db6b9220af736]: pulpcore.tasking.tasks:INFO: Task completed 018bb28c-135d-7785-bf7c-01f5658bb9ee

api log:

Nov 09 05:27:15 gunicorn[2899920]: 127.0.0.1 - admin [09/Nov/2023:05:27:15 +0000] "GET /pulp/api/v3/content/rpm/packages/?repository_version=%2Fpulp%2Fapi%2Fv3%2Frepositories%2Frpm%2Frpm%2F10e51ae6-65c7-42aa-8ab1-ffebdf752500%2Fversions%2F841%2F&limit>
Nov 09 05:27:16 gunicorn[2899917]: 127.0.0.1 - admin [09/Nov/2023:05:27:16 +0000] "GET /pulp/api/v3/repositories/rpm/rpm/?name=rhel8-epel HTTP/1.1" 200 694 "-" "HTTPie/3.2.1"
Nov 09 05:28:47 gunicorn[2899914]: [2023-11-09 05:28:47 +0000] [2899914] [CRITICAL] WORKER TIMEOUT (pid:2899918)
Nov 09 05:28:48 gunicorn[2899914]: [2023-11-09 05:28:48 +0000] [2899914] [WARNING] Worker with pid 2899918 was terminated due to signal 9

Pulp starts a task to delete a repository version for the rhel8-epel repo, and then my script starts a content search in this repo which causes a worker timeout.

Our WORKER_TTL is set to 90s.

x9c4 · November 9, 2023, 10:01am

Hey, I think this really is not the same issue. But I can imagine, that deleting the old repository version (as per retain_repo_versions) is possibly holding a lot of posgres row locks on the RepositoryContent table in a longtime transaction, blocking all concurrent read access.
Maybe we need to find a way to take the repo_version offline without deleting it right away. Then update the repository content record in chunks and then finally delete it.

In any case, it’d be worth writing an issue. Bonus points for describing a minimal reproducer.

ggainey · November 9, 2023, 1:49pm

We’ve also had a report where deleting a bunch of repo-versions/content at once results in the worker getting a visit from the OOMKiller. Do your logs show any traces of such a thing?

In general, it definitely looks like we need to look more closely at optimizing the delete-path for time and memory.

wiad · November 9, 2023, 2:08pm

We’ve also had a report where deleting a bunch of repo-versions/content at once results in the worker getting a visit from the OOMKiller. Do your logs show any traces of such a thing?

No, no sign of any OOM-kills in my case.