How Does Pulp Handle Dropped DB Connections During Long-Running Syncs?

bli111 · September 30, 2025, 11:42pm

We’re using Pulp (with the pulp_deb plugin) to sync multiple large Ubuntu repositories. These sync operations can take a very long time—in some cases, potentially over 6 hours.

In our environment, the PostgreSQL database is accessed through an HAProxy layer that enforces a 6-hour timeout on idle connections. We’ve observed that if a sync is still in progress when this timeout is hit, and the database connection is dropped, the sync operation fails and does not recover.

A few questions:

Does Pulp have any built-in support for reconnecting or retrying database connections during long-running tasks (e.g., during content synchronization)?
Is there any guidance on how Pulp handles DB-level interruptions mid-task?
Are there best practices or recommended configurations for environments where long-running jobs may exceed typical proxy idle timeouts?

From what we can tell, the Pulp worker maintains the connection during these operations, but it’s unclear whether there’s enough activity to prevent the connection from being marked idle by the proxy.

Any insights or recommendations would be greatly appreciated.

x9c4 · October 1, 2025, 7:59am

The answer here depends on the version of pulpcore used.
But in general, I suspect the worker to not recycle the database connection. Though every task will be executed in a separate process and that starts with a new connection.

bli111 · October 1, 2025, 2:46pm

Thanks for the clarification.

From what we’ve observed in our environment (running pulpcore 3.85.1), if the database connection is dropped mid-task (e.g., due to a proxy timeout), the task appears to fail and terminate without any automatic retry or recovery.

Given that, we’d like to know:

Is there any plan or roadmap item to improve resilience in this area—specifically handling unexpected DB connection drops during task execution?
Would support for automatic reconnection or retry on failure be considered in future versions?

We understand this might be complex due to how tasks are managed in separate worker processes, but better resilience here would be very valuable for environments with long-running sync jobs behind proxies.

Thanks again for your guidance!

x9c4 · October 1, 2025, 2:53pm

What i can say, there is not a plan, but we are currently implementing changes that make talking about such a plan possible.

Aleksey · October 2, 2025, 9:23am

Which repository are you syncing and with what remote settings?
A possible workaround would be to split the large repository into smaller ones, perhaps by distribution. We also use haproxy in front of the database with a timeout set to 1 hour.

bli111 · October 2, 2025, 4:23pm

Thanks for the suggestion.

In our case, we need to maintain the same repository structure as upstream because of limitations in our client setup. This was discussed here: Matching Ubuntu repository directory structure in pulp_deb.

Because of that requirement, we end up syncing a large number of distributions into a single repository, which is what seems to trigger the long-running operations and eventual timeout. Unfortunately, splitting the repository by distribution isn’t an option for us due to how clients consume the content.

Do you see any other possible workarounds in this case, or would this scenario fall into the category of improvements that might be addressed by the ongoing pulpcore changes mentioned earlier?

quba42 · October 2, 2025, 5:59pm

It is a tough combination of requirements, and I can’t help with the questions about re-connecting to a severed DB connection. I am used to monolithic installations that don’t run into this issue.

However, I will mention a possible workaround that you already mentioned in the other thread: Using separate remotes for different distributions and then syncing multiple remotes into a single Pulp repository using policy “additive”. I don’t have a lot of experience doing this, but it should work in principle.

A variant on this approach would be to sync individual APT repo distributions into different Pulp repositories, and then copy from all of those sync repositories into another Pulp repository to combine things. You would then publish and distribute just that combined repository to your users. Again, this should work in principle, but I don’t have much experience where the pitfalls might lie.

Does anyone know if there is a simple way to copy all content from one repository version to some target repository? There is these docs: Advanced Copy - Pulp Project but they are about copying a list of content units from one repo to another which would be really cumbersome for this use case.

Aleksey · October 2, 2025, 7:16pm

If the method with automatic restart of a failed task suits you, it seems squeezer is restarting of a failed tasks with the error “Task process died unexpectedly with exitcode 1.” You’re getting this error, right? I found a task with a similar error.

"pulp_href": "/pulp/api/v3/tasks/0199211d-fa27-7a0c-a6cd-f1a42ed98f13/",
"pulp_created": "2025-09-06T22:20:36.776545Z",
"state": "failed",
"name": "pulp_deb.app.tasks.synchronizing.synchronize",
"logging_cid": "5f75ef60dc624c10915772edaaea4d14",
"created_by": "/pulp/api/v3/users/2/",
"started_at": "2025-09-06T22:20:36.836200Z",
"finished_at": "2025-09-06T23:44:21.924892Z",
"error": {
  "reason": "Task process died unexpectedly with exitcode 1."
},
"worker": "/pulp/api/v3/workers/0198be62-d85d-77e6-a373-bd9e8831df98/",
"parent_task": null,
"child_tasks": [],
"task_group": null,
"progress_reports": [
  {
    "message": "Update ReleaseFile units",
    "code": "update.release_file",
    "state": "completed",
    "total": null,
    "done": 1,
    "suffix": null
  },
  {
    "message": "Update PackageIndex units",
    "code": "update.packageindex",
    "state": "completed",
    "total": null,
    "done": 8,
    "suffix": null
  },
  {
    "message": "Associating Content",
    "code": "associating.content",
    "state": "completed",
    "total": null,
    "done": 2292,
    "suffix": null
  },
  {
    "message": "Downloading Artifacts",
    "code": "sync.downloading.artifacts",
    "state": "completed",
    "total": null,
    "done": 932,
    "suffix": null

The next task started after 4 seconds, it doesn’t look like it was started manually so quickly

"pulp_href": "/pulp/api/v3/tasks/0199216a-b70f-75e7-af09-b48fa57027ae/",
"pulp_created": "2025-09-06T23:44:25.917307Z",
"state": "completed",
"name": "pulp_deb.app.tasks.synchronizing.synchronize",
"logging_cid": "3c7735562b6a49b292a2dd99873e298a",
"created_by": "/pulp/api/v3/users/2/",
"started_at": "2025-09-06T23:44:25.993128Z",
"finished_at": "2025-09-07T00:35:03.224056Z",
"error": null,
"worker": "/pulp/api/v3/workers/0198be61-d306-7e25-8c39-723ae6bc7dcd/",
"parent_task": null,
"child_tasks": [],
"task_group": null,
"progress_reports": [
  {
    "message": "Downloading Artifacts",
    "code": "sync.downloading.artifacts",
    "state": "completed",
    "total": null,
    "done": 3,
    "suffix": null
  },
  {
    "message": "Update ReleaseFile units",
    "code": "update.release_file",
    "state": "completed",
    "total": null,
    "done": 1,
    "suffix": null
  },
  {
    "message": "Update PackageIndex units",
    "code": "update.packageindex",
    "state": "completed",
    "total": null,
    "done": 8,
    "suffix": null
  },
  {
    "message": "Associating Content",
    "code": "associating.content",
    "state": "completed",
    "total": null,
    "done": 2292,
    "suffix": null

UPDATE: This isn’t squeezer. These are retries set in an Ansible task.

FAILED - RETRYING: [localhost]: Pulp Debian <|> Sync DEB remotes into repositories (3 retries left).

changed: [localhost] => (item=debian.trixie.trixie)

bmx0r · October 21, 2025, 8:58am

Hi,
In my environment I cross many Firewals that cut connection without traffic after 15 minutes
We are using the following parameter to setup keepalives

  PULP_DATABASES__default__OPTIONS__keepalives: 1
  PULP_DATABASES__default__OPTIONS__keepalives_idle: 900
  PULP_DATABASES__default__OPTIONS__keepalives_interval: 10
  PULP_DATABASES__default__OPTIONS__keepalives_count: 5
  PULP_DATABASES__default__OPTIONS__tcp_user_timeout: 945

I remeber this good documentation from aws rds about that: Dead connection handling in PostgreSQL - Amazon Relational Database Service

Hope it helps
Cheers
Mike