Destroying RPM publications is slow

zapp42 · March 1, 2022, 4:53pm

Hi,
I have ~23000 publications due to an unfortunate combination of hooking up our Jenkins instance(s) to Pulp with an auto-publish rpm repository.

Now destroying those publications is quite slow (using the “pulp rpm publication destroy” command).

My question: Why is this so slow? What is happening on the server? What can I do to get the same result faster (maybe some direct DB modification?).

Thanks,
Rolf

ggainey · March 4, 2022, 3:16pm

Hey Rolf - ok, let’s see.

A pulp_rpm Publication includes a lot of entities - basically, all the files you find in /repodata. Deleting the publication cascades to deleting each of those as well. For 23K publications, this means A LOT of database work is going on. To each, we’re adding the startup cost, and network-cost, of pulp-cli. Even if all of that means each call takes, say, only 500msec - we’re still looking at more than 3 hours to get through 23K deletes. I expect that adding in the cli-startup and network time makes it more like ~1s per call, so 6+ hours.

You could set up your script to parallelize the calls. However, ‘delete’ calls don’t generate a background task; you’d just run out of httpd workers, and probably overwhelm postgres, while that was going on.

My thought is, you’re better off letting this happen as it is, and not try to convince pulp to do more deletes at once. This will let the instance continue doing the work it should be doing, while the deletes happen. Also, this should be a one-time cleanup - once it’s done, you’ll never need to delete that many publications at once.

If you really want to get the job done faster, you could remove pulp-cli from the timing by going to the REST APi directly (e.g., http DELETE publication-href-here) . You could even script that up to do the deletes in parallel (but see prev RE “overwhelm your instance”)

wibbit · March 7, 2022, 10:30am

Can I also advocate the use of the Python API for work like this, it would be relatively trivial to put together a script that would thread out to a set number of workers (say 10) that would go off and quietly churn through the deletes for you.

Honestly, having the python client libraries available to you (if you’re proficient in python) can be a real boon for interrogating the pulp infrastructure in a relatively safe manner.