Api server memory leak?

bli111 · May 10, 2023, 3:50pm

Problem:
Api server memory usage keep growing

Expected outcome:
Stable memory usage

Pulpcore version:
3.22.4

Pulp plugins installed and their versions:
pulp-ansible 0.16.1
pulp-cli 0.19.0
pulp-deb 2.20.2
pulp-file 1.12.0
pulp-glue 0.19.0
pulp-rpm 3.19.3
pulp-rpm-client 3.19.3
pulpcore 3.22.4
pulpcore-client 3.22.4

Operating system - distribution and version:
rhel 8.6

Other relevant data:
Api server memory usage keep growing. Eventually the host runs out of memory and all api calls start failing. This keep happening when we make large amount of api calls. Is there any setting to avoid this?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
52627 pulpd 20 0 9718376 8.3g 0 D 1.3 54.0 39:55.13 gunicorn
84945 pulpd 20 0 5922944 5.3g 620 D 1.0 34.5 7:29.14 gunicorn

dralley · May 11, 2023, 5:30am

Tracked here: https://github.com/pulp/pulpcore/issues/2250

There are two workarounds. The first is to turn on process recycling Gunicorn process recycling as described here: https://github.com/pulp/pulpcore/issues/2250#issuecomment-1069455371

The second is to “trim changelogs” using the pulpcore-manager rpm-trim-changelogs command. Which depending on how you have it installed may need to be expressed differently, I believe on Katello installations you need to use sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' /usr/bin/pulpcore-manager rpm-trim-changelogs

Those two things don’t completely eliminate the issue but they make it quite a bit better

evgeni · May 24, 2023, 11:20am

We’ve been hitting a similar issue in one of our deployments (running pulpcore 3.21.5 underneath Katello 4.7) and while investigating it, I stumbled about the fact that today we do not preload the application code in gunicorn (Settings — Gunicorn 20.1.0 documentation).

While this obviously wouldn’t solve the fact that the app itself needs memory to operate, it shrinks the “freshly booted app” memory footprint from ~500MB to ~125MB (it’s running with --workers 5 here) for us. (and pulpcore-content shrinks from 1.5G to ~500M, with 17 workers)

The only downside that I can see is that now you can’t kill individual workers to reload-the code in a “rolling” manner, but I don’t think that’s too bad in the Pulpcore context?

@dralley do you know if --preload was ever discussed for the Gunicorn deployment?

dralley · May 24, 2023, 2:05pm

I don’t believe so. @bmbouter probably knows the most about this. I agree that in a Katello / Satellite context that’s an acceptable downside.

Keep in mind the leak happens when indexing repositories so it won’t be apparent after a simple restart. How do things look with --max-requests 10 + --preload, vs. just --preload?

evgeni · May 24, 2023, 2:24pm

I did not test the max requests part yet. But will do and will report!

evgeni · May 25, 2023, 8:25am

Here is a super unscientific comparison from our setup (5 API workers, ~900 RPM distributions). The sizes are as reported by systemd for the whole pulpcore-api.service (= 1 scheduler and 5 worker gunicorn processes).

state	default	preload	max-requests 100	preload + max-requests 100
fresh boot	480MB	121MB	450MB	121MB
after 500× `pulp status`	520MB	252MB	455MB	221MB
`pulp rpm distribution list --limit 1000`	525MB	410MB	485MB	287MB
`pulp rpm distribution list --base-path={EACH_BASE_PATH}`	614MB	498MB	498MB	412MB

The last step was a loop over all distributions, performing a list with --base-path set (so ca 910 requests in total). While clearly not a big memory hog, it shows a bit of growth and was easy enough to script
No idea why max-requests made the fresh boot size smaller, it doesn’t really make sense, but oh well.

evgeni · May 25, 2023, 9:32am

And after syncing a few RHEL repos, the whole thing still managed to climb up to 9.6GB memory usage (and two OOM kills), but that’s much better than before (5+ kills).

Most of the “leaking” happens when Katello does its indexing and requests a lot of details about packages, modules etc.

evgeni · June 16, 2023, 11:24am

After having played a bit with monitoring and analyzing the content app (see Questions about scaling the content app),
I decided to come back to the API, as it is clearly the biggest offender in terms of memory today.

I have a very basic Katello test setup (pulpcore 3.22.2, pulp-rpm 3.19.2) on a machine with 32GB RAM and 4vCPU.
(No, that’s not the same as the above setup, but here I can go crazy without affecting other peoples deployments).
It has CentOS 7 (Base, Updates, Extras) and CentOS Stream 8 (BaseOS, AppStream) repos synced.

To generate some load, I am using the following, rather simple API script.
It is not using the official bindings so I can later more easily port that over to locust or something.

import urllib.parse
import requests

LIMIT=2000
BASE_URL='https://katello.example.com'
API_URL=f'{BASE_URL}/pulp/api/v3'

class PulpAPI:
    def __init__(self):
        self.session = requests.Session()
        self.session.cert = ('/etc/foreman/client_cert.pem', '/etc/foreman/client_key.pem')
        self.session.verify = True

    def request(self, url):
        r = self.session.get(url)
        r.raise_for_status()
        return r.json()

api = PulpAPI()

repositories = api.request(f'{API_URL}/repositories/?limit=1000')
while True:
  for repo in repositories['results']:
    print(repo['name'])
    version = repo['latest_version_href']

    query = {'arch__ne': 'src', 'fields': 'pulp_href,name,version,release,arch,epoch,summary,is_modular,rpm_sourcerpm,location_href,pkgId',
             'limit': LIMIT, 'offset': 0, 'repository_version': version}
    query_s = urllib.parse.urlencode(query)
    url = f'{API_URL}/content/rpm/packages/?{query_s}'
    while url:
       x = api.request(url)
       url=x['next']

TL;DR for those who do not want to read Python: it fetches the first 1000 repositories known to Pulp and then in fetches details about all non-source RPM packages in those repos until interrupted.
It’s probably broken for setups that have non-RPM repositories, but I don’t care for now

The API is configured with --max-requests 50 --max-requests-jitter 30.

With 5 API workers (Katello default for that size of a VM), running this script for ~6 minutes makes the API memory consumption jump to 26GB in a total of 150 requests.
150 requests, across 5 workers is 30 requests per worker, which means no restarts happened.
Shortly after that, the kernel started killing gunicorn because of OOM. Well, let’s define 26G as the maximum for now,
but keep in mind that this box is totally idle otherwise and you wouldn’t be getting that much memory when Katello is actually processing the returned data.

I’ve then started experimenting with the number of workers (alternating 5 and 3) and the number of requested records per call (LIMIT: 2000, 1000, 500).
Katello does 2000 per default, as fetching things in multiple smaller requests takes more time (see https://github.com/Katello/katello/pull/7862).

Interestingly, even with 3 workers, it still managed to get to 26G quite quickly, without any restarts.
The best result was with 3 workers and LIMIT=500 – the restarts managed to keep the usage below 9GB.

However, that’s still a lot (and would have probably still result in OOM on a box that has an active Katello),
and also not really a fix as I as the restarts are highly dependent on the number of requests and not on the size of the worker.

Looking at https://github.com/benoitc/gunicorn/issues/1299 and https://github.com/benoitc/gunicorn/blob/master/examples/when_ready.conf.py,
I decided to try actually looking at the memory footprint and to kill workers “on demand”.

# based on https://github.com/benoitc/gunicorn/blob/master/examples/when_ready.conf.py

import signal
import threading
import time

import psutil

max_mem = 2048

class MemoryWatch(threading.Thread):

    def __init__(self, server, max_mem):
        super().__init__()
        self.daemon = True
        self.server = server
        self.max_mem = max_mem
        self.timeout = server.timeout / 2

    def memory_usage(self, pid):
        return psutil.Process(pid).memory_info().rss / (1024 * 1024)

    def run(self):
        while True:
            for (pid, worker) in list(self.server.WORKERS.items()):
                if self.memory_usage(pid) > self.max_mem:
                    self.server.log.info(f"Killing worker {pid} (memory usage > {self.max_mem}MB).")
                    self.server.kill_worker(pid, signal.SIGTERM)
            time.sleep(self.timeout)


def when_ready(server):
    server.log.info(f"Starting MemoryWatch with {max_mem}MB limit per worker")
    mw = MemoryWatch(server, max_mem)
    mw.start()

This is using psutil instead of shelling out to ps like the example does, as it felt more natural,
but otherwise the idea is the same: have a watcher thread that analyzes the memory usage of the workers
and sends a KILL signal if the usage is over a threshold (I picked 2GB rather randomly).
This happens every 45 seconds, as Katello by default deploys with --timeout 90.

With 3 workers, and LIMIT=2000, this keeps the memory usage at ~6GB, which makes sense as workers are getting killed once they reach 2GB.
With 5 workers, and LIMIT=2000, this keeps the memory usage at ~9GB, which is close enough to 2*5=10
Even better, that’s close to the numbers we’ve seen with 3 workers and LIMIT=500 above, while processing bigger requests.

Is that the silver bullet and are the numbers final? Certainly not.
I will repeat the experiment, maybe even with some parallelization (as Katello does that too) and some more variance in the tried sizes of workers, batch sizes and memory limits.
But it’s certainly a direction that can give us a more stable API services while we figure out how to fix the underlying memory leak.

bli111 · June 28, 2023, 12:06am

I noticed the warning message guncorn “Worker terminated due to single 9”. I thought this was caused by the gunicorn settings. After removing the /var/lib/gunicorn.py. I still see this happens. As the result, the jobs which are being submitted to the api worker failed. I am wondering how to avoid this. To workaround the memory leak issue, I run a simple script to check free memory and also access.log and restart api when there is no current connection. It seems work well but no idea why worker terminated occasionally even memory usage is low.

Jun 16 15:16:03 pulpd1 gunicorn[60167]: [2023-06-16 15:16:03 -0400] [60167] [WARNING] Worker with pid 97832 was terminated due to signal 9
Jun 16 15:16:03 pulp1 gunicorn[103290]: [2023-06-16 15:16:03 -0400] [103290] [INFO] Booting worker with pid: 103290

/var/lib/pulp $ more gunicorn.conf.py
max_requests = 1000
max_requests_jitter = 50

ggainey · June 28, 2023, 12:11pm

kill -9 means something, somewhere, decided that that process needed to die RIGHT NOW. OOMKiller does that, but leaves a ton of obvious output in the system logs when it does so. I think that task-cancel might, if the task in question takes “too long” to leave the system. Are there any other log entries referencing “97832”? That string will show up in any log about that worker.

bli111 · June 28, 2023, 5:02pm

It was not OOMKiller. The system had plenty of memory when it happened. It looks like timeout. This happened when we have multiple sync and/or publish jobs. What usually cause the timeout? Is it a good idea to increase WORKER_TTL? Should we increase or reduce number of pulpcore-works?

Jun 27 18:33:12 pulpp-tt-146 gunicorn[91051]: [2023-06-27 22:33:12 +0000] [91051] [INFO] Worker exiting (pid: 91051)
Jun 27 18:33:13 pulp123 gunicorn[91049]: [2023-06-27 18:33:13 -0400] [91049] [WARNING] Worker with pid 91051 was terminated due to signal 9
Jun 27 18:33:13 pulp123 gunicorn[96141]: [2023-06-27 18:33:13 -0400] [96141] [INFO] Booting worker with pid: 96141

ggainey · June 28, 2023, 5:22pm

What was the task assigned to worker-91051 trying to do? Was postgres slammed? Are there any network errors in the logs? If it’s a sync - is it a specific remote that’s causing you trouble?

Decreasing number-of-workers will only be useful if there are “too many” for your system. If CPU/memory/postgres aren’t pegged, then I wouldn’t think that would affect things. Maybe network-bandwidth is being used up? You can play with download_concurrency, total_timeout, and rate_limit if you think it’s “too much network” being used.

WORKER_TTL will make Pulp wait longer before it shoots a worker that hasn’t posted a heartbeat (which is what seems to be happening here), that might be a good thing to try tweaking first.

dralley · July 22, 2023, 3:19am

We figured it out, there is a simple patch for this: Fix a significant memory leak by dralley · Pull Request #4090 · pulp/pulpcore · GitHub

It’ll go out in a release shortly though I don’t know how long it will take to reach Katello

evgeni · August 9, 2023, 8:17am

FTR, dralley’s patch has been backported to 3.22 in 3.22.9 which has been available for Katello 4.8+ since 2023-07-26
(it’s also been backported to 3.21 in 3.21.12, but that has not been pushed to Katello repos for 4.7 yet, and I do not know if it will be)

That said, I applied this patch manually on my Katello 4.7 installation and the API memory usage has been awesome (<3GB while syncing big repos, sometimes peaks to 4G, but clearing itself afterwards, not leaking).

evgeni · August 12, 2023, 6:57am

3.21.15 should show up soon for Katello 4.7: https://github.com/theforeman/pulpcore-packaging/pull/703

evgeni · August 15, 2023, 9:35am

It has been published and is available to users of 4.7 now.