Pulp sync fails with worker node gone missing

midhuhk · July 25, 2024, 9:10am

Hi Team ,

I am using openshift pulp operator and is running creation/sync using pulp squeezer module. Pulp sync gets failed and worker pods are getting restarted wtih sync.

Error logs in pod
pulp [4340f90deb9e496e94b9cc2ed63b2bb3]: pulpcore.tasking.tasks:INFO: Starting task 01905e7c-2f1b-7cfc-ad83-8e1496310e32
pulp [4340f90deb9e496e94b9cc2ed63b2bb3]: pulp_rpm.app.tasks.synchronizing:INFO: Synchronizing: repository=test-epel-rhel9 remote=base-epel-rhel9
Exception ignored in: <generator object PulpcoreWorker.iter_tasks at 0x7f94dc1f3350>
Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py”, line 286, in iter_tasks
break
File “/usr/local/lib/python3.9/site-packages/pulpcore/app/models/task.py”, line 116, in exit
with connection.cursor() as cursor:
File “/usr/local/lib/python3.9/site-packages/django/utils/asyncio.py”, line 26, in inner
return func(*args, **kwargs)
File “/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py”, line 330, in cursor
return self._cursor()
File “/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py”, line 308, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File “/usr/local/lib/python3.9/site-packages/django/db/utils.py”, line 91, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File “/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py”, line 308, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File “/usr/local/lib/python3.9/site-packages/django/utils/asyncio.py”, line 26, in inner
return func(*args, **kwargs)
File “/usr/local/lib/python3.9/site-packages/django/db/backends/postgresql/base.py”, line 330, in create_cursor
cursor = self.connection.cursor()
File “/usr/local/lib/python3.9/site-packages/psycopg/connection.py”, line 852, in cursor
self._check_connection_ok()
File “/usr/local/lib/python3.9/site-packages/psycopg/connection.py”, line 485, in _check_connection_ok
raise e.OperationalError(“the connection is closed”)
django.db.utils.OperationalError: the connection is closed
Traceback (most recent call last):
File “/usr/local/bin/pulpcore-worker”, line 8, in
sys.exit(worker())
File “/usr/local/lib/python3.9/site-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.9/site-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.9/site-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.9/site-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/usr/local/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py”, line 43, in worker
PulpcoreWorker().run(burst=burst)
File “/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py”, line 410, in run
self.handle_available_tasks()
File “/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py”, line 393, in handle_available_tasks
self.supervise_task(task)
File “/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py”, line 336, in supervise_task
connection.connection.execute(“SELECT 1”)
File “/usr/local/lib/python3.9/site-packages/psycopg/connection.py”, line 891, in execute
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Process Process-2:
Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py”, line 89, in _execute
return self.cursor.execute(sql, params)
File “/usr/local/lib/python3.9/site-packages/psycopg/cursor.py”, line 732, in execute
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

New pod says
[be3075@dk1lxjump01 ~]$ oc logs pod/pulp-server-nonprod-worker-684897ffcc-7p87s
Waiting on postgresql to start…
Postgres started.
Checking for database migrations
Database migrated!
pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
pulp [None]: pulpcore.tasking.worker:INFO: Cleaning up task 01905e7c-2f1b-7cfc-ad83-8e1496310e32 and marking as failed. Reason: Worker has gone missing.

Any suggestions ?

ggainey · June 28, 2024, 11:39am

Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py”, line 89, in _execute
return self.cursor.execute(sql, params)
File “/usr/local/lib/python3.9/site-packages/psycopg/cursor.py”, line 732, in execute
raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

This says to me the database (“server”, above) dropped the connection mid-transaction. What’s showing in the logs from the DB pod? Any sign of the OOMKiller coming to visit? What’s your sizing on your pods, esp the db-pod?

midhuhk · June 28, 2024, 12:03pm

Looks like database went to recovery mode . II have used default sizing for db pod.

2024-06-28 10:56:50.316 UTC [54] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2024-06-28 10:56:50.317 UTC [3072] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.415 UTC [3073] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.420 UTC [1] LOG: all server processes terminated; reinitializing
2024-06-28 10:56:50.533 UTC [3074] LOG: database system was interrupted; last known up at 2024-06-28 10:53:20 UTC
2024-06-28 10:56:50.534 UTC [3075] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.534 UTC [3076] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.538 UTC [3077] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.615 UTC [3078] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.615 UTC [3079] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.621 UTC [3080] FATAL: the database system is in recovery mode
2024-06-28 10:56:50.715 UTC [3081] FATAL: the database system is in recovery mode
2024-06-28 10:56:51.021 UTC [3074] LOG: database system was not properly shut down; automatic recovery in progress
2024-06-28 10:56:51.023 UTC [3074] LOG: redo starts at 0/209D780
2024-06-28 10:56:52.891 UTC [3082] FATAL: the database system is in recovery mode

Let me try to increase the resource and come back on this. Thank you