Pulp openshift operator worker nodes stuck in DB migration when db pod is restarted

Hi ,

As part of POC with openshift pulp operator ,When db pod is deleted in a running instance , all other pods are stuck with below

Database migration in progress. Waiting…
Database migration in progress. Waiting…

DB pod says below
"core_contentappstatus"."name", "core_contentappstatus"."last_heartbeat", "core_contentappstatus"."versions" FROM "core_contentappstatus" WHERE "core_contentappstatus"."name" = '28@pulp-server-nonprod-content-fd7f88975-xh54b' LIMIT 21
[similar errors repeated]
Any suggestions

Hi @midhuhk,

“[…] When db pod is deleted in a running instance […]”

Was this env PoC deployed with an ephemeral db pod or was the database volume also deleted?
If so, after deleting the postgres pod/volume, all of its data is also lost.
In this case, to recreate the database schema you can run the django migrations through:

  • get the name of a content or api pod, for example:
$ oc get pod -l app=pulp-content -oname
  • open a remote shell session into the init container and run the migrations:
$ oc rsh -cinit-container pod/pulp-content-58787cdc8f-p8znk
  • run the migration
sh-5.1$ pulpcore-manager migrate
error: Failed to initialize NSS library <----- you can ignore this error message
Operations to perform:
  Apply all migrations: ansible, auth, certguard, container, contenttypes, core, deb, file, maven, ostree, python, rpm, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK                                                                                                                                                                                                                                                
  Applying contenttypes.0002_remove_content_type_name... OK                                                                                                                                                                                                                               
  Applying core.0001_initial... OK                                                                                                                                                                                                                                                        
  Applying core.0002_increase_artifact_size_field... OK                                                                                                                                                                                                                                   
  Applying core.0003_remove_upload_completed... OK                                                                                                                                                                                                                                        
  Applying core.0004_add_duplicated_reserved_resources... OK                                                                                                                                                                                                                              
Access policy for distributions/file/file created.                                                                                                                                                                                                                                        
Access policy for publications/file/file created.                                                                                                                                                                                                                                         
Access policy for remotes/file/file created.                                                                                                                                                                                                                                              
Access policy for repositories/file/file/versions created.                                                                                                                                                                                                                                
Access policy for repositories/file/file created.  

After that, check if all pods are in a Running state again:

$ oc get pods
NAME                                                READY   STATUS    RESTARTS      AGE
pulp-api-84d8fcffbf-lj72p                           1/1     Running   7 (56s ago)   12d
pulp-content-58787cdc8f-p8znk                       1/1     Running   0             15m
pulp-database-0                                     1/1     Running   0             15m
pulp-operator-controller-manager-56d994b744-vjlk9   2/2     Running   2 (11d ago)   12d
pulp-worker-795ddc6569-q4zxx                        1/1     Running   3 (15m ago)   12d

Hi ,pvc is attached to db pod and I see its still mounted in api/content pod .So data is still there I guess .

I cannot rsh to init containers
be3075@dk1lxjump01 ~]$ oc logs pod/pulp-server-nonprod-content-559f4bf567-5rq79
Waiting on postgresql to start…
Postgres started.
Checking for database migrations
Database migration in progress. Waiting…
Database migration in progress. Waiting…
Database migration in progress. Waiting…
Database migration in progress. Waiting…

[be3075@dk1lxjump01 ~]$ oc rsh -cinit-container pod/pulp-server-nonprod-content-559f4bf567-5rq79
error: unable to upgrade connection: container not found (“init-container”)

Current pod status
[be3075@dk1lxjump01 ~]$ oc get pods
pulp-operator-controller-manager-699d9c7876-s8zf2 2/2 Running 0 11h
pulp-server-nonprod-api-664b4b8ff5-9grzr 0/1 Init:0/1 0 11h
pulp-server-nonprod-api-664b4b8ff5-rbr5s 0/1 Running 116 (47s ago) 11h
pulp-server-nonprod-content-559f4bf567-5rq79 0/1 Running 0 11h
pulp-server-nonprod-content-559f4bf567-x9hzl 1/1 Running 0 11h
pulp-server-nonprod-database-0 1/1 Running 0 6m44s
pulp-server-nonprod-redis-7448d64656-kcw9m 1/1 Running 0 11h
pulp-server-nonprod-worker-6df9cddfdb-4swrb 1/1 Running 0 11h
pulp-server-nonprod-worker-6df9cddfdb-x9w9v 1/1 Running 1 (6m33s ago) 11h

Can migration executed from pod directly ? Also will data be corrupted/lost with running the migration manually connecting to pods ?

Hum… this is strange.
From your first message, I thought that all of your pods were waiting for db migration:

“When db pod is deleted in a running instance , all other pods are stuck with below […]”

but checking your new message, I can see that there are some in a Running state already (meaning they already checked for pending migrations and started the application process):

pulp-server-nonprod-content-559f4bf567-x9hzl 1/1 Running 0 11h
pulp-server-nonprod-worker-6df9cddfdb-4swrb 1/1 Running 0 11h
pulp-server-nonprod-worker-6df9cddfdb-x9w9v 1/1 Running 1 (6m33s ago) 11h

I guess the ones that are not running yet are the ones you have reprovisioned, right?
But even this way, it is still strange because there is a pod in init state (probably pending migrations):

pulp-server-nonprod-api-664b4b8ff5-9grzr 0/1 Init:0/1 0 11h

and 2 running, but not READY yet (readiness probe failing?):

pulp-server-nonprod-api-664b4b8ff5-rbr5s 0/1 Running 116 (47s ago) 11h
pulp-server-nonprod-content-559f4bf567-5rq79 0/1 Running 0 11h

maybe we have different problems here.
Can you provide a must-gather or adm-inspect outputs, so we can have more information about this?

Also, can you please check if all pulpcore pods (api, content and worker) have the same pulp-minimal image:

oc get pods -ojson| jq -r '.items[]|[.metadata.name,.status.containerStatuses[0].imageID]|@tsv'

Hi ,

Eventhough pods shows as running , log says about wiating for migration.

[be3075@dk1lxjump01 ~]$ oc get pods |grep -i worker
pulp-server-nonprod-worker-6df9cddfdb-4swrb 1/1 Running 0 18h
pulp-server-nonprod-worker-6df9cddfdb-x9w9v 1/1 Running 2 (99m ago) 18h
[be3075@dk1lxjump01 ~]$ oc logs pod/pulp-server-nonprod-worker-6df9cddfdb-4swrb |more
Waiting on postgresql to start…
Postgres started.
Checking for database migrations
Database migration in progress. Waiting…
Database migration in progress. Waiting…

init-contianer checks for /usr/local/bin/pulpcore-manager showmigrations | grep ‘[ ]’ and output is as below
sh-5.1$ /usr/local/bin/pulpcore-manager showmigrations | grep ‘[ ]’
[ ] 0001_initial
[ ] 0002_advanced_collections
[ ] 0003_add_tags_and_collectionversion_fields
[ ] 0004_add_fulltext_search_indexes
[ ] 0005_collectionversion_is_highest
[ ] 0006_remove_whitelist_and_alter_collection_version_name
[ ] 0007_collectionversion_is_certified
[ ] 0008_collectionremote_requirements_file
[ ] 0009_collectionimport
[ ] 0010_ansible_related_names
[ ] 0011_collectionimport
[ ] 0012_auto_20190906_2253
[ ] 0013_pulp_fields
sh-5.1$ /usr/local/bin/pulpcore-manager showmigrations | grep ‘[ ]’

Both are from 2 content pods

Considering that both outputs are the same, I’m discarding the possibility of the pods connecting to different dbs.
It seems like no migrations ran at all.
Can you please check if the following command will return any error

oc exec pulp-worker-74fcf5dcb9-bdthr -- /usr/local/bin/pulpcore-manager migrate

I have executed that in one my cluster . Pods are up , however all repositories (remote and local) are empty after that .This specific cluster has synced multiple redhat repos at the moment and would like to keep the data.

Considering the number of pending migrations, it seems like when you deleted the db pod all your data was also lost.

Can you provide us the output of the following commands in https://pastebin.centos.org?

oc get pod pulp-server-nonprod-database-0 -oyaml
oc exec pulp-server-nonprod-database-0 -- mount|grep postgres
oc exec pulp-server-nonprod-database-0 -- sh -c 'echo $PGDATA'
oc exec pulp-server-nonprod-database-0 -- sh -c 'ls $PGDATA'

Updated the details - pulp-openshift-operator-worker-nodes-stuck-in-db-m - Pastebin Service

Disabling init container running database migration wont help as pods show below erro.

hum… there is no PVC attached for the postgres data (it is using the emptyDir volume):

    - mountPath: /var/lib/postgresql/data
      name: pulp-server-nonprod-postgres
  - emptyDir: {}
    name: pulp-server-nonprod-postgres

that is why when you deleted the pod all of its data is lost and we had to run the migrations again.

Can you share your Pulp CR (in pastebin, please), so we can see how the operator is configured to deploy the db?

oc get pulp -oyaml

Here is a doc on how to configure pulp operator to deploy a postgres instance with PVC: Pulp Operator storage configuration - Pulp Project

Updated in pastebin - Pulp operator database issue - Pastebin Service

ok, I can see that you provided a StorageClass for your pulpcore pods:

    file_storage_access_mode: ReadWriteMany
    file_storage_size: 50Gi
    file_storage_storage_class: ocs-storagecluster-cephfs

but there is no definition of storage for the database pods:

          cpu: 4
          memory: 4Gi
          cpu: 250m
          memory: 256Mi

to use a StorageClass for the db pods you need to provide the database.postgres_storage_class field, for example:

      postgres_storage_class: <name of the storage class>
          cpu: 4
          memory: 4Gi
          cpu: 250m
          memory: 256Mi

Thank you for the findings.
postgres_storage_class was removed as it throwed below error

2024-07-05T16:57:43Z INFO repo_manager/database.go:104 The pulp-server-nonprod-database StatefulSet has been modified! Reconciling …
2024-07-05T16:57:43Z ERROR repo_manager/database.go:113 Error trying to update the pulp-server-nonprod-database StatefulSet object … {“error”: “StatefulSet.apps “pulp-server-nonprod-database” is invalid: spec: Forbidden: updates to statefulset spec for fields other than ‘replicas’, ‘ordinals’, ‘template’, ‘updateStrategy’, ‘persistentVolumeClaimRetentionPolicy’ and ‘minReadySeconds’ are forbidden”}

However setting pvc for databae (removing postgres_storage_class) looks fine at the moment.

current config
cpu: 4
memory: 4Gi
cpu: 250m
memory: 256Mi
pvc: pulp-server-nonprod-file-storage

Any suggestions ?

This error means that the operator tried to update the volumeClaimTemplates field from StatefulSet, but this field is immutable.
To workaround this error, you can try to configure your Pulp CR with the database.postgres_storage_class again and delete the pulp-server-nonprod-database StatefulSet (your running database pod will be removed). The operator should now recreate a new StatefulSet with the expected StorageClass.