Stuck Sync Task After Worker Restart (pulpcore 3.90.0)

We’re running pulpcore 3.90.0 and using pulp_deb to sync multiple Ubuntu distributions into a single repository. During one of these long-running sync jobs, the host rebooted and the pulp-worker process was restarted mid-task.

After the reboot, the task remains in a “running” state but shows no progress. We attempted to cancel it, but it has since been stuck in the “cancelling” state. This prevents us from starting a new sync because the task’s resources remain locked.

Questions:

  1. What’s the recommended way to safely clean up or recover from a stuck “running” or “cancelling” task after a worker restart?
  2. Is there a supported way to release the locked resources manually so we can re-run the sync?

As I understand, the worker process handling the task was gone without the chance (or time) to teardown.

Assuming the task is stuck at canceling right now, can you share the output of the follwing commands?
It would really help understand what’s going on.

pulp task show --href <task-href>
pulp worker list

When the task’s worker is gone in an unexpected way, it’s database record should be cleaned up by other workers on regular basis. Cleaning up the missing worker record releases the task lock and enables other workers to pick the stale task and perform a proper teardown. Calling cancelling in the API changes the state to cancelling, but a worker must still pick it for the cleanup.

So I suspect there might be a problem with the worker record cleanup.

You can try the following. As @x9c4 has already clarified in other circunstamces, this is safe because we are strict about a worker missing state:

pulpcore-manager shell -c "from pulpcore.app.models import AppStatus; print(AppStatus.objects.missing().delete())"
1 Like

I have the same problem on core 3.90.0

github pulpcore issue #7012.

I cannot paste links yet.

1 Like

Thanks for the explanation — that makes sense.

I ran the suggested commands and here are the results:

$ pulp task show --href=/pulp/api/v3/tasks/0199c184-0e91-7260-92ee-1c8e0d0b30c7/
{
  "pulp_href": "/pulp/api/v3/tasks/0199c184-0e91-7260-92ee-1c8e0d0b30c7/",
  "prn": "prn:core.task:0199c184-0e91-7260-92ee-1c8e0d0b30c7",
  "pulp_created": "2025-10-08T01:51:21.242543Z",
  "pulp_last_updated": "2025-10-08T01:51:21.234334Z",
  "state": "canceling",
  "name": "pulp_deb.app.tasks.synchronizing.synchronize",
  "logging_cid": "e635d6f2c0474424a84293f4d4d962dd",
  "created_by": "/pulp/api/v3/users/1/",
  "unblocked_at": "2025-10-08T01:51:21.300100Z",
  "started_at": "2025-10-08T01:51:21.373350Z",
  "finished_at": null,
  "error": null,
  "worker": null,
  "parent_task": null,
  "child_tasks": [],
  "task_group": null,
  "progress_reports": [
    {
      "message": "Update ReleaseFile units",
      "code": "update.release_file",
      "state": "running",
      "total": null,
      "done": 21,
      "suffix": null
    },
    {
      "message": "Associating Content",
      "code": "associating.content",
      "state": "running",
      "total": null,
      "done": 839372,
      "suffix": null
    },
    {
      "message": "Downloading Artifacts",
      "code": "sync.downloading.artifacts",
      "state": "running",
      "total": null,
      "done": 370985,
      "suffix": null
    },
    {
      "message": "Update PackageIndex units",
      "code": "update.packageindex",
      "state": "running",
      "total": null,
      "done": 240,
      "suffix": null
    }
  ],
  "created_resources": [
    "<unavailable>"
  ],
  "reserved_resources_record": [
    "prn:deb.aptrepository:0199c182-f9bf-7cd6-877d-d015a95b71cc",
    "shared:prn:deb.aptremote:0199c183-c5f0-7721-8ca7-e55fa47e4775",
    "shared:prn:core.domain:8089ccfe-0950-4f54-8c63-465f7dd33873"
  ],
  "result": null
}
$ pulp worker list
[
  {
    "pulp_href": "/pulp/api/v3/workers/0199c97e-3066-781f-9ced-9d4bd2f8e089/",
    "prn": "prn:core.appstatus:0199c97e-3066-781f-9ced-9d4bd2f8e089",
    "pulp_created": "2025-10-09T15:01:54.408525Z",
    "pulp_last_updated": "2025-10-09T15:01:54.408540Z",
    "name": "1@9b8b7d17e00c",
    "last_heartbeat": "2025-10-13T16:44:09.136460Z",
    "versions": {
      "deb": "3.7.0",
      "rpm": "3.32.0",
      "core": "3.90.0",
      "file": "3.90.0",
      "ostree": "2.5.0",
      "certguard": "3.90.0"
    },
    "current_task": null
  }
]

I also executed the following command as suggested:

$ podman run --name clean  pulp-minimal:3.90.0 pulpcore-manager shell -c "from pulpcore.app.models import AppStatus; print(AppStatus.objects.missing().delete())"
(0, {})

The task still remains in the “canceling” state and is not being released or retried.

At this point, it looks like the cleanup process didn’t trigger a proper teardown.
Is there a safe way to manually mark this task as failed or force-release its resources so we can re-run the sync job?

Thanks again for the help!

@PotentialIngenuity said the task was released for him here, but it could be just luck and the bug is still there.

In any case, we did some changes since 3.90.0 related to tasking bugs, it would be great if you can test with the latest version. For example, one of the fix is related to a scenario with only one worker, as it looks is your case here.

Maybe @gerrod snippets from your other post can help here too.

I’m out of very specific ideas here. Can you share the logs with the correlation id (logging_cid) of the task to see if we can spot something.

1 Like