Application is not available error

Today, at around 5:36 PM UTC, discourse was slow to respond until eventually I got a 504 error message when I visited discourse.pulpproject.org. Then I got a webpage that said:

Application is not available
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

When I tried to ping the address, it timed out as well:

$ ping discourse.pulpproject.org    
PING test-discourse-pulpproject-org.apps.ospo-osci.z3b1.p1.openshiftapps.com (35.155.91.231) 56(84) bytes of data.
^C
--- test-discourse-pulpproject-org.apps.ospo-osci.z3b1.p1.openshiftapps.com ping statistics ---
41 packets transmitted, 0 received, 100% packet loss, time 40948ms
1 Like

So a look on the console showed indeed a error, the openshift node being marked as “non responsive”. I do not know why, but afaik, it will migrate to a new node if anything happen, and it should restart by itself. As we do not have access neither to the underlying cloud, nor the lower part of the stack, I suggest to wait for now, in case that’s just a random problem (computer in the cloud also fail, and we do not have a full HA system, there is 1 application pod and that’s it, rest is openshift magic).

3 Likes

I hit a problem just now (~ 17:10 UTC) trying to submit a post. The first time I got a 502 error and then a 504 error.

Looks like discourse was down again for a few minutes. I got a 503 when trying to submit a post and then when I went to discourse.pulpproject.org, I got a application is not responding error.

1 Like

This happened a few times yesterday and @quba42 reported it happened this morning also.

1 Like

I think that’s now fixed. The root cause is a bug in the openshift deployment. Kubernetes (and so Openshift) has a concept of readiness/liveness probe, a command ran on regular interval to see if a process is not stuck. After the upgrade to 4.8, it seems something changed (or it was broken since forever), and the liveness probe for one of the process didn’t work. So Openshift retry a few time, then just kill and restart the pod. Then try again. It fail, etc, etc.

For people with a passion for YAML, the patch is here: https://github.com/jontrossbach/openshift-discourse/pull/7/files

Sorry, I do not verify discourse that often, I noted there was errors after the 4.8 upgrade, but Discourse was working when I checked, so I thought this was not a problem.

2 Likes