Getting intermittent "Connection aborted.', ConnectionResetError(104, 'Connection reset by peer') " with pulp openshift operatory

midhuhk · February 12, 2025, 3:54pm

Hi Team ,

I am using pulp openshift operator and is managing the repository using the route mirrors . I am getting intermittent connection timed out error when running pulp command or when clients try to download the packages . Openshift resources looks fine and there are no errors in logs or events .
Out of 10 , I get connection reset for 2-3 times

Eg:
root@yd6543:~]# pulp rpm repository list
Error: (‘Connection aborted.’, ConnectionResetError(104, ‘Connection reset by peer’))
[root@yd654:~]# pulp rpm repository list
[
{
“pulp_href”: “/pulp/api/v3/repositories/rpm/rpm/0191dff7-4582-7f76-abcf-5889d41655a7/”,

Error on clients:

[MIRROR] kernel-modules-5.14.0-427.35.1.el9_4.x86_64.rpm: Curl error (56): Failure when receiving data from the peer for
https://pulp-server-prod-ocp-pulp-2vlx.apps.inf-oip00.dbadmin.danskenet.net/pulp/content/test/rhel9/baseos/Packages/k/kernel-modules-5.14.0-427.35.1.el9_4.x86_64.rpm
[OpenSSL SSL_read: Connection reset by peer, errno 104]

And with next try it works . Any suggestions or clue on similar issues ?
(I have reverted repo mirror to old setup due to this errors. ).

hyagi · September 16, 2024, 1:32pm

Hi @midhuhk

Would you mind sending us the output from the following commands in pastebin.centos.org?

oc get routes -ojson
oc logs --tail 50  --timestamps -l app.kubernetes.io/component=api

note: don’t forget to remove any sensitive information (like route.spec.host or status.ingress[].routerCanonicalHostname) in case they should not be exposed/public

midhuhk · September 18, 2024, 12:38pm

Issue was observed in 12th/13th post which mirrros were removed .

https://paste.centos.org/view/c7c23310
https://paste.centos.org/view/83264870
https://paste.centos.org/view/963c7114

hyagi · September 18, 2024, 1:54pm

Thank you for the outputs.

From them, I can see that the API pods seems to be in a healthy state. The kubelet probes were all returning a 200 OK, so it is probably not OCP removing the pod from the endpoint list because of it not being in a ready state.
I also noticed that all GET requests to the repositories endpoints returned a 200 OK:

awk '/GET [/]pulp[/]api[/]v3[/]status[/]/ {if ($12!=200){print}}' 83264870
awk '/GET [/]pulp[/]api[/]v3[/]repositories[/]rpm[/]rpm[/]/ {if ($12!=200){print}}' 83264870

with that in mind, it doesn’t seem the connection resets came from the application side (Pulp), so as the next step, we can try to isolate the problem by doing the same tests from another pod running in the same k8s cluster.
For example:

oc run -i -t pulp-test --image=quay.io/fedora/python-310:latest --restart=Never --command -- bash
pip install pulp-cli
pulp config create --base-url http://pulp-api-svc.pulp.svc:24817 --username admin --password password  #### make sure to point to the k8s service addresss, not the route host
pulp rpm repository list   ###### repeat this step to see if we can reproduce the error

if the above test succeeds, we can discard some components (like ovn/ovs, or host-host communication, etc) and change the focus to other points (like communication between ocp routers and pulp pods, external LB and ocp routers, etc)

midhuhk · September 20, 2024, 5:58am

unfortunately this is production cluster and extenal connectivity and other pod access are restricted .
I am paralllely checking on the loadbalancer side as well .

Curl on mirros show SSL error intermittently …

[root@yb4400.danskenet.net TEST:~]# curl https://pulp-server-prod/pulp/content/

Index of /pulp/content/

prod/
syst/
test/

[root@yb4400.danskenet.net TEST:~]# curl https://pulp-server-prod/pulp/content/
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

hyagi · September 20, 2024, 11:14am

unfortunately this is production cluster and extenal connectivity and other pod access are restricted .

ahnn … I understand.

[root@yb4400.danskenet.net TEST:~]# curl https://pulp-server-prod/pulp/content/

Hum … it is also happening with other Pulp components.
Do you know if there are other apps (not Pulp) running in this cluster and with a similar issue? I am asking this so we could discard issues in configs like MTU (if you have access to the ocp nodes you could check if, for example, the 100 bytes overhead for the vxlan was took into consideration).

I am paralllely checking on the loadbalancer side as well .

Nice! Maybe we can also check the communication from your host to pulp bypassing the routes:

repeat these tests with multiple api/content pods running in different nodes and taking notes of where each pod is running in case of failure, so we can verify if the error is happening in specific nodes

oc port-forward pod/<api pod> 24817:24817
curl localhost:24817/pulp/api/v3/status/

oc port-forward pod/<pulp content pod> 24816:24816
curl localhost:24816/pulp/content/

or pod to pod communication

oc exec <api pod> -- curl -s <content svc address>:24816/pulp/content/
oc exec <content pod> -- curl -s <api svc address>:24817/pulp/api/v3/status/

for example:
oc exec pulp-api-59fc794b5d-b4624 -- curl -s pulp-content-svc:24816/pulp/content/
oc exec pulp-content-c899df87d-7xp7s -- curl -s pulp-api-svc:24817/pulp/api/v3/status/

midhuhk · September 23, 2024, 10:16am

pod to pod communication is fine and its displaying output . Accessing route oad balancer directly is cusing intermittent connectivity issues