Project

General

Profile

Actions

action #132167

closed

asset uploading failed with http status 502 size:M

Added by tinita 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-06-29
Due date:
2023-08-30
% Done:

0%

Estimated time:

Description

Observation

This job http://openqa.opensuse.org/tests/3393876 failed in the uploading phase

[2023-06-29T18:11:42.386959+02:00] [error] Failed uploading asset
[2023-06-29T18:11:42.387207+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@64bit-2G-:TW.21297.qcow2: 502 response: Bad Gateway

At that time the webui was apparently stopped and restarted:

Jun 29 16:11:42 ariel systemd[1]: Stopping The openQA web UI...
Jun 29 16:11:42 ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Jun 29 16:11:42 ariel systemd[1]: Stopped The openQA web UI.
Jun 29 16:13:12 ariel systemd[1]: Started The openQA web UI.

I would expect nginx to return a 503 in that case, and the asset uploading should retry.

From /var/log/zypp/history we can see that the service was restarted during package upgrade:

2023-06-29 16:11:33|command|root@ariel|'zypper' '-n' '--no-refresh' 'dup' '--auto-agree-with-licenses' '--download-in-advance'|
2023-06-29 16:11:34|install|openQA-common|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|1791b2e78db52c05fd9adc1ace2ac7b4451ec328d1558141796d90b8ec812c52|
2023-06-29 16:11:34|install|openQA-munin|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|b35e7151cd7b3b7f4224ae5244763267d29a20a48b5ea64a85ea397b13219bd6|
2023-06-29 16:11:36|install|openQA-continuous-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|910b7875523e37ae0db7c76cab74fe9d4113fcd2c07fe4a4578e2fd7a67c2d21|
2023-06-29 16:11:37|install|openQA-client|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|490b772224e5a64f354db398199ab23789fa7307d5edd8c561381a7dfe107e60|
2023-06-29 16:11:38|install|openQA-auto-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|6699637a855eccc6d5420c0c1129623da9c33096b25130ec9914963c9a526065|
2023-06-29 16:13:13|install|openQA|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|d137cfea77fc8567cfd7f9d5a5ccb372a54670da7e33ce5a3de293d53bb3c7f4|
2023-06-29 16:13:14|install|openQA-local-db|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|352062e0def659d7b349084300465b0161e5083b5a167a41ca49b312eb6cca63|

But 1.5m is quite long. What caused that?

From /var/log/openqa.1.xz:

[2023-06-29T16:11:42.554141Z] [info] [pid:890] Worker 661 websocket connection closed - 1006
[2023-06-29T16:11:42.554307Z] [info] [pid:890] Worker 642 websocket connection closed - 1006
[2023-06-29T16:11:42.554567Z] [info] [pid:890] Worker 505 websocket connection closed - 1006
[2023-06-29T16:11:45.099867Z] [info] Listening at "http://127.0.0.1:9528" <-- liveview handler
[2023-06-29T16:11:45.100547Z] [info] Listening at "http://[::1]:9528"
(90s delay)
[2023-06-29T16:13:14.544271Z] [info] Listening at "http://127.0.0.1:9527" <-- websocket
[2023-06-29T16:13:14.544929Z] [info] Listening at "http://[::1]:9527"
[2023-06-29T16:13:14.918142Z] [info] Loading external plugin AMQP
[2023-06-29T16:13:14.973300Z] [info] Loading external plugin ObsRsync
[2023-06-29T16:13:15.527287Z] [info] Listening at "http://127.0.0.1:9526" <-- webUI
[2023-06-29T16:13:15.527842Z] [info] Listening at "http://[::1]:9526"
[2023-06-29T16:13:15.527939Z] [info] Manager 8148 started
[2023-06-29T16:13:15.537953Z] [info] Worker 8301 started

From /var/log/openqa_scheduler

[2023-06-29T16:11:42.131776Z] [debug] [pid:889] Scheduler took 0.77747s to perform operations and allocated 0 jobs
[2023-06-29T16:11:42.131955Z] [debug] [pid:889] [rescheduling] Current tick is at 0 ms. New tick will be in: 20000 ms
(90s delay)
[2023-06-29T16:13:14.675640Z] [info] Listening at "http://127.0.0.1:9529"
[2023-06-29T16:13:14.676281Z] [info] Listening at "http://[::1]:9529"

Where does the 90s delay come from?

Maybe we need to bump the timeouts within nginx to be a bit longer what we had in apache?

Acceptance criteria

  • AC1: Automatic package upgrades of openQA cause no confusing error reports to users
  • AC2: Automatic package upgrades of openQA do not cause incomplete openQA jobs

Suggestions

  • Apache returns 503 if the openQA webUI, why does nginx return 502?
  • https://www.section.io/blog/504-503-errors-difference/
  • Ensure that openQA jobs do not incomplete in upload phase when the webUI is absent for some seconds
  • Investigate what took the webUI 1.5m to restart

Files


Related issues 6 (0 open6 closed)

Related to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Related to openQA Project - action #133496: [openqa-in-openQA] test fails due to not loaded install test modulesResolvedokurz2023-07-28

Actions
Related to openQA Infrastructure - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Copied to openQA Project - action #133673: nginx on ariel logs in multiple locations size:MResolvedokurz

Actions
Copied to openQA Project - action #133769: hydra.opensuse.org causing excessive load on o3 size:SResolvedlivdywan2023-06-292023-08-04

Actions
Copied to openQA Project - action #133772: Munin making many requests on o3 size:MResolvedtinita

Actions
Actions #1

Updated by tinita 10 months ago

Also, hook scripts were failing a few times last night because of a 502 error, but I don't see a restart at that time:

Jun 29 23:22:21 ariel openqa-gru[29344]: 502 Bad Gateway
Jun 29 23:22:21 ariel openqa-gru[29344]: 
Jun 29 23:22:21 ariel openqa-gru[29343]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 111
Jun 29 23:22:21 ariel openqa-gru[29301]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 111
Jun 29 23:23:31 ariel openqa-gru[31734]: 502 Bad Gateway
Jun 29 23:23:32 ariel openqa-gru[31734]: 
Jun 29 23:23:32 ariel openqa-gru[31733]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 111
Jun 29 23:23:32 ariel openqa-gru[31703]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 111
Jun 30 00:01:07 ariel openqa-gru[884]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:01:07 ariel openqa-gru[884]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:01:07 ariel openqa-gru[884]: <head><title>502 Bad Gateway</title></head>
[...]
Jun 30 00:01:20 ariel openqa-gru[2192]: ERROR:/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs:Error while fetching http://openqa.opensuse.org/tests/3394769/investigation_ajax: 502 Server Error: Bad Gateway for url: http://openqa.opensuse.org/tests/3394769/inves[...]
Jun 30 00:01:20 ariel openqa-gru[2192]: Traceback (most recent call last):
[...]
Jun 30 00:01:20 ariel openqa-gru[2192]:   File "/usr/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
Jun 30 00:01:20 ariel openqa-gru[2192]:     raise HTTPError(http_error_msg, response=self)
Jun 30 00:01:20 ariel openqa-gru[2192]: requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://openqa.opensuse.org/tests/3394769/investigation_ajax
Jun 30 00:01:20 ariel openqa-gru[2209]: 502 Bad Gateway
Jun 30 00:02:10 ariel openqa-gru[2209]: 
Jun 30 00:02:10 ariel openqa-gru[3015]: openqa-clone-job (62 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output[...]
Jun 30 00:02:10 ariel openqa-gru[3015]: openqa-clone-job (62 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output[...]
Jun 30 00:02:11 ariel openqa-gru[2984]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:02:11 ariel openqa-gru[2984]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:02:58 ariel openqa-gru[3610]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:02:58 ariel openqa-gru[3610]: openqa-cli (70 /opt/os-autoinst-scripts/openqa-investigate): (openqa-cli api --header[...]
Jun 30 00:02:58 ariel openqa-gru[3610]: <head><title>502 Bad Gateway</title></head>
Jun 30 00:02:58 ariel openqa-gru[3610]: <body>
Jun 30 00:02:58 ariel openqa-gru[3610]: <center><h1>502 Bad Gateway</h1></center>
Jun 30 00:02:58 ariel openqa-gru[3610]: <hr><center>nginx/1.21.5</center>
Jun 30 00:02:58 ariel openqa-gru[3610]: </body>
Jun 30 00:02:58 ariel openqa-gru[3610]: [11B blob data]
Actions #2

Updated by okurz 10 months ago

  • Related to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Actions #3

Updated by okurz 10 months ago

From /var/log/zypp/history we can see that the service was restarted during package upgrade:

2023-06-29 16:11:33|command|root@ariel|'zypper' '-n' '--no-refresh' 'dup' '--auto-agree-with-licenses' '--download-in-advance'|
2023-06-29 16:11:34|install|openQA-common|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|1791b2e78db52c05fd9adc1ace2ac7b4451ec328d1558141796d90b8ec812c52|
2023-06-29 16:11:34|install|openQA-munin|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|b35e7151cd7b3b7f4224ae5244763267d29a20a48b5ea64a85ea397b13219bd6|
2023-06-29 16:11:36|install|openQA-continuous-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|910b7875523e37ae0db7c76cab74fe9d4113fcd2c07fe4a4578e2fd7a67c2d21|
2023-06-29 16:11:37|install|openQA-client|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|490b772224e5a64f354db398199ab23789fa7307d5edd8c561381a7dfe107e60|
2023-06-29 16:11:38|install|openQA-auto-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|6699637a855eccc6d5420c0c1129623da9c33096b25130ec9914963c9a526065|
2023-06-29 16:13:13|install|openQA|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|d137cfea77fc8567cfd7f9d5a5ccb372a54670da7e33ce5a3de293d53bb3c7f4|
2023-06-29 16:13:14|install|openQA-local-db|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|352062e0def659d7b349084300465b0161e5083b5a167a41ca49b312eb6cca63|

But 1.5m is quite long. What caused that?

Maybe we need to bump the timeouts within nginx to be a bit longer what we had in apache?

Actions #4

Updated by okurz 10 months ago

  • Subject changed from [openQA-in-openQA] asset uploading failed with http status 502 to [openQA-in-openQA] asset uploading failed with http status 502 size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to High
Actions #5

Updated by jbaier_cz 10 months ago

  • Assignee set to jbaier_cz
Actions #6

Updated by jbaier_cz 10 months ago

Another such instance:

Jul 14 08:50:39 ariel openqa-gru[8456]: 502 Bad Gateway
Jul 14 08:50:46 ariel openqa-gru[8456]:
Jul 14 08:50:46 ariel openqa-gru[8577]: ERROR:/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs:Error while fetching http://openqa.opensuse.org/api/v1/jobs/3428793: 502 Server Error: Bad Gateway for url: http://openqa.opensuse.org/api/v1/jobs/3428793
Jul 14 08:50:46 ariel openqa-gru[8577]: Traceback (most recent call last):
Jul 14 08:50:46 ariel openqa-gru[8577]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 302, in <module>
Jul 14 08:50:46 ariel openqa-gru[8577]:     main(parse_args())
Jul 14 08:50:46 ariel openqa-gru[8577]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 192, in main
Jul 14 08:50:46 ariel openqa-gru[8577]:     test_data = fetch_url(test_url, request_type="json")
Jul 14 08:50:46 ariel openqa-gru[8577]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 140, in fetch_url
Jul 14 08:50:46 ariel openqa-gru[8577]:     raise (e)
Jul 14 08:50:46 ariel openqa-gru[8577]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 137, in fetch_url
Jul 14 08:50:46 ariel openqa-gru[8577]:     content.raise_for_status()
Jul 14 08:50:46 ariel openqa-gru[8577]:   File "/usr/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
Jul 14 08:50:46 ariel openqa-gru[8577]:     raise HTTPError(http_error_msg, response=self)
Jul 14 08:50:46 ariel openqa-gru[8577]: requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://openqa.opensuse.org/api/v1/jobs/3428793

This was also caught with zabbix (our web scenario). The interesting part is, that the update did not actually update anything:

Jul 14 08:46:38 ariel openqa-continuous-update[2356]: All repositories have been refreshed.
Jul 14 08:46:39 ariel openqa-continuous-update[2559]: Loading repository data...
Jul 14 08:46:41 ariel openqa-continuous-update[2559]: Reading installed packages...
Jul 14 08:46:47 ariel openqa-continuous-update[2559]: Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Jul 14 08:46:47 ariel openqa-continuous-update[2559]: Computing distribution upgrade...
Jul 14 08:46:50 ariel openqa-continuous-update[2559]: Nothing to do.
Actions #7

Updated by jbaier_cz 10 months ago

From the access log

192.168.47.102 - - [14/Jul/2023:08:45:41 +0000] "GET /api/v1/jobs?ids=3428813 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
192.168.47.102 - - [14/Jul/2023:08:45:41 +0000] "GET /api/v1/jobs?ids=3428798 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
192.168.47.102 - - [14/Jul/2023:08:45:41 +0000] "GET /api/v1/jobs?ids=3428815 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
...
192.168.47.102 - - [14/Jul/2023:08:45:41 +0000] "GET /api/v1/jobs?ids=3428786 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt=
"0.000"
192.168.112.12 - - [14/Jul/2023:08:45:41 +0000] "POST /api/v1/jobs/3428396/artefact HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.031 urt="
0.000"
...
192.168.47.102 - - [14/Jul/2023:08:46:56 +0000] "GET /api/v1/jobs?ids=3428767 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
192.168.47.102 - - [14/Jul/2023:08:46:56 +0000] "GET /api/v1/jobs?ids=3428747 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
...
192.168.112.18 - - [14/Jul/2023:08:46:56 +0000] "POST /api/v1/jobs/3428536/artefact HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
192.168.112.18 - - [14/Jul/2023:08:46:56 +0000] "POST /api/v1/jobs/3428381/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)" rt=0.062 urt="0.061"
192.168.47.102 - - [14/Jul/2023:08:46:56 +0000] "GET /api/v1/jobs?ids=3428804 HTTP/1.1" 200 1518 "-" "Python/3.11 aiohttp/3.8.4" rt=0.064 urt="0.065"
192.168.47.102 - - [14/Jul/2023:08:46:56 +0000] "GET /api/v1/jobs?ids=3428763 HTTP/1.1" 200 970 "-" "Python/3.11 aiohttp/3.8.4" rt=0.064 urt="0.065"
192.168.47.102 - - [14/Jul/2023:08:46:57 +0000] "GET /api/v1/jobs?ids=3428737 HTTP/1.1" 502 157 "-" "Python/3.11 aiohttp/3.8.4" rt=0.000 urt="0.000"
...

From the error_log, there is a corresponding error, which is hard to spot among many other similar lines:

2023/07/14 08:46:58 [error] 16227#16227: *18664650 no live upstreams while connecting to upstream, client: 192.168.47.102, server: openqa.opensuse.org, request: "GET /api/v1/jobs?ids=3426432 HTTP/1.1", upstream: "http://webui/api/v1/jobs?ids=3426432", host: "openqa.opensuse.org"

The upstream settings is:

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30;
}

Is the 30 connection limit enough? It corresponds to the exec "$(dirname "$0")"/openqa prefork -m production --proxy -i 100 -H 900 -w 30 -c 1 -G 800 "$@" used to execute the webui.

Do we have enough processes to handle the traffic?

Actions #8

Updated by jbaier_cz 10 months ago

  • Status changed from Workable to In Progress
Actions #9

Updated by jbaier_cz 10 months ago

/var/log/openqa* show nothing interesting around that time

Actions #10

Updated by okurz 10 months ago

I also found another occurence of 502 right now in
https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5542118585/jobs/10152542083

Job state of job ID 3431318: running, waiting …
{"blocked_by_id":null,"id":3431318,"result":"none","state":"running"}
Job state of job ID 3431318: running, waiting …
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.21.5</center>
</body>
</html>
Actions #11

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by jbaier_cz 10 months ago

I suspected tooling, but we do consider 502 and 503 equally in our scripts; retry in openqa-cli was added years ago. Nginx uses 502 due to the fact that it serves as a proxy; in the old setup, Apache was acting as the server responsible for handling requests.

However, it seems to me, that both AC1 and AC2 are not at all related to this problem and the root cause is just too many request for our server to handle and the solution might be to increase the connection limit and worker processes accordingly. The current magic number is 30 as can be seen in the configuration:

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30;
}
exec "$(dirname "$0")"/openqa prefork -m production --proxy -i 100 -H 900 -w 30 -c 1 -G 800 "$@"
Actions #13

Updated by kraih 10 months ago

Yea, all the no live upstreams while connecting to upstream errors in the log from the past few days (at different times, not just upgrades) does suggest that the webui is overloaded and needs more capacity.

Actions #14

Updated by okurz 10 months ago

kraih wrote:

Yea, all the no live upstreams while connecting to upstream errors in the log from the past few days (at different times, not just upgrades) does suggest that the webui is overloaded and needs more capacity.

Providing more capacity to the webui is a good idea but should not be the concern for this ticket. And it should be expected that the openQA webui is unresponsive either due to package upgrades or being overloaded, that does not really matter. So this is what we should look into, e.g. make sure that the worker retries connection

Actions #15

Updated by kraih 10 months ago

Our upstream config for the webui is maybe also a bit too strict, i think the default timeout for getting an upstream connection is only 10 seconds and 1 attempt. Something like this might already help:

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30 max_fails=3 fail_timeout=30s;
}

Edit: Trying max_fails=5 fail_timeout=30s on O3 for a bit.

Update: This did not work.

Actions #16

Updated by kraih 10 months ago

okurz wrote:

And it should be expected that the openQA webui is unresponsive either due to package upgrades or being overloaded, that does not really matter. So this is what we should look into, e.g. make sure that the worker retries connection

That's fair. It should be the first priority. But given how fresh our nginx setup still is, we also shouldn't overlook possible improvements there. I wouldn't be surprised if some timeout tweaking could avoid many reconnects in the first place.

Actions #17

Updated by okurz 10 months ago

Sure, I am all for it. I actually didn't mean that we would need to do changes in the worker code. The worker has already some means for retrying and that worked reliably for a long time. So I also suspect that nginx config changes could cause the worker to retry better or something

Actions #18

Updated by jbaier_cz 10 months ago

kraih wrote:

Trying max_fails=3 fail_timeout=30s on O3 for a bit.

Good idea, thanks.

okurz wrote:

e.g. make sure that the worker retries connection

As far as I was able to find out, we already do that. However in this case the retries will overload the server as well. Tweaking the timeout (or any other configuration options) is probably the best approach right now.

Actions #19

Updated by kraih 10 months ago

I've just seen a whole bunch of new errors at a time where no maintenance should have happened:

2023/07/18 09:02:08 [error] 11900#11900: *12836469 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437429/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437429/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:08 [error] 11901#11901: *12836562 no live upstreams while connecting to upstream, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437649/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437649/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:08 [error] 11901#11901: *12836568 no live upstreams while connecting to upstream, client: 192.168.47.102, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437059/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437059/artefact", host: "openqa.opensuse.org"
2023/07/18 09:02:08 [error] 11902#11902: *12836565 no live upstreams while connecting to upstream, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437618/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437618/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:13 [error] 11901#11901: *12737480 no live upstreams while connecting to upstream, client: 192.168.112.3, server: openqa.opensuse.org, request: "GET /api/v1/mm/children HTTP/1.1", upstream: "http://webui/api/v1/mm/children", host: "openqa1-opensuse"
2023/07/18 09:02:13 [error] 11901#11901: *12837653 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437670/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437670/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:18 [error] 11901#11901: *12838969 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437612/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437612/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:19 [error] 11902#11902: *12839159 no live upstreams while connecting to upstream, client: 192.168.112.7, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437236/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437236/artefact", host: "openqa1-opensuse"
2023/07/18 09:02:19 [error] 11902#11902: *12839155 no live upstreams while connecting to upstream, client: 192.168.47.102, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3437058/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3437058/artefact", host: "openqa.opensuse.org"

CPU load is almost at 100% and all webui workers seem active. So it looks like O3 is currently overloaded.

Actions #20

Updated by okurz 10 months ago

kraih wrote:

CPU load is almost at 100% and all webui workers seem active. So it looks like O3 is currently overloaded.

That's nicely confirming our suspicion and supports what we stated we should do to improve.

Actions #21

Updated by kraih 10 months ago

One more thing i want to try is to reduce the number of upstream connections, maybe to max_conns=25:

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=25 max_fails=3 fail_timeout=30s;
}

This might seem a bit counter intuitive, but the rationale is that perhaps we are running too close to the capacity limits of the webui workers, and nginx will happily overwhelm upstreams if we let it.

Update: This did not work.

Actions #22

Updated by kraih 10 months ago

kraih wrote:

One more thing i want to try is to reduce the number of upstream connections, maybe to max_conns=25...

Unfortunately this appears to have done nothing over the past 24 hours. Still a large number of upstream errors:

2023/07/19 09:47:10 [error] 21195#21195: *22660668 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440331/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440331/artefact", host: "openqa1-opensuse"
2023/07/19 09:47:14 [error] 21195#21195: *22662258 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440230/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440230/artefact", host: "openqa1-opensuse"
2023/07/19 09:47:18 [error] 21195#21195: *22663367 no live upstreams while connecting to upstream, client: 192.168.47.102, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3439087/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3439087/artefact", host: "openqa.opensuse.org"
2023/07/19 09:47:25 [error] 21195#21195: *22665270 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440218/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440218/status", host: "openqa1-opensuse"
2023/07/19 09:47:27 [error] 21195#21195: *22666035 no live upstreams while connecting to upstream, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440217/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440217/artefact", host: "openqa1-opensuse"
2023/07/19 09:47:28 [error] 21195#21195: *22666365 no live upstreams while connecting to upstream, client: 192.168.112.12, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440193/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440193/artefact", host: "openqa1-opensuse"
2023/07/19 09:47:49 [error] 21195#21195: *22671471 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440354/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440354/status", host: "openqa1-opensuse"
2023/07/19 09:47:49 [error] 21195#21195: *22671586 no live upstreams while connecting to upstream, client: 192.168.112.7, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440233/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440233/artefact", host: "openqa1-opensuse"
2023/07/19 09:48:20 [error] 21194#21194: *22679375 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440167/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440167/artefact", host: "openqa1-opensuse"
2023/07/19 09:48:20 [error] 21195#21195: *22679390 no live upstreams while connecting to upstream, client: 192.168.112.12, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440151/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440151/status", host: "openqa1-opensuse"
2023/07/19 09:48:33 [error] 21195#21195: *22682087 no live upstreams while connecting to upstream, client: 192.168.112.12, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440151/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440151/artefact", host: "openqa1-opensuse"
2023/07/19 09:48:33 [error] 21195#21195: *22682094 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440309/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440309/status", host: "openqa1-opensuse"
2023/07/19 09:48:39 [error] 21192#21192: *22683757 no live upstreams while connecting to upstream, client: 192.168.112.23, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3440149/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3440149/artefact", host: "openqa1-opensuse"
Actions #23

Updated by jbaier_cz 10 months ago

  • Status changed from In Progress to Blocked

I think we should block this on #132143 for now and make new changes after the move is done.

Actions #24

Updated by livdywan 9 months ago

  • Status changed from Blocked to Feedback

Asset upload seems to have been fixed https://openqa.opensuse.org/group_overview/24 so we don't need to block on the move.

Actions #25

Updated by livdywan 9 months ago

  • Due date changed from 2023-07-29 to 2023-08-04

Further points to investigate:

  • We no longer have the proxy
  • Does nginx NOT retry when it exceeds the 30 connections (unlike Apache)?
  • We did have 502 errors BEFORE the migration
  • Issues from Apache times may just be more visible with nginx?
  • We can try rate limits

It seems unclear right now, and the result of a quick poll on the call was to leave the ticket in feedback for a while to monitor it, so I'm also bumping the due date.

Actions #26

Updated by okurz 9 months ago

  • Assignee changed from jbaier_cz to livdywan

assigning to @livdywan to lead further efforts as decided in weekly

Actions #27

Updated by okurz 9 months ago

  • Status changed from Feedback to Workable
  • Priority changed from High to Urgent

https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5719915990/job/15498813797?pr=140#step:5:148 is another occurence when using the openQA client. I am convinced that in before the proxy, i.e. apache instead of nginx, was more forgiving.

If you want to have this solved before the due date expires this becomes urgent now.

Actions #28

Updated by livdywan 9 months ago

  • We can try rate limits
limit_req_zone $request_uri zone=by_uri:10m rate=1000r/m;
[...]
limit_req zone=by_uri burst=5 nodelay;

Something like this?

Actions #29

Updated by livdywan 9 months ago

  • Subject changed from [openQA-in-openQA] asset uploading failed with http status 502 size:M to asset uploading failed with http status 502 size:M

Dropping the test name from the title, this is not a test issue per se

Actions #30

Updated by livdywan 9 months ago

  • Copied to action #133673: nginx on ariel logs in multiple locations size:M added
Actions #31

Updated by tinita 9 months ago

We just looked at a big batch of 502 from yeterday.

From openqa.error_log:

2023/08/01 16:13:19 [error] 32576#32576: *13674918 upstream prematurely closed connection while reading response header from upstream, client: 195.135.
221.150, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3473702/artefact HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473702/arte
fact", host: "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32577#32577: *13674926 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1
95.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3473686/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473686
/status", host: "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32576#32576: *13674925 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1
95.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3473605/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473605
/status", host: "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32576#32576: *13391926 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1
95.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/barrier/test_t09_openvpn_tunnel_ifdown_post_run?check_dead_job=1 HTTP/1.1", upstrea
m: "http://127.0.0.1:9526/api/v1/barrier/test_t09_openvpn_tunnel_ifdown_post_run?check_dead_job=1", host: "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32576#32576: *13674923 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 2
17.140.109.10, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3473593/artefact HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/347359
3/artefact", host: "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32576#32576: *13674932 connect() failed (111: Connection refused) while connecting to upstream, client: 195.135.221.150, se
rver: openqa.opensuse.org, request: "POST /api/v1/jobs/3473707/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473707/status", host: "o
penqa.opensuse.org"
2023/08/01 16:13:19 [error] 32577#32577: *13674929 connect() failed (111: Connection refused) while connecting to upstream, client: 217.140.109.10, ser
ver: openqa.opensuse.org, request: "POST /api/v1/jobs/3473595/artefact HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473595/artefact", host:
 "openqa.opensuse.org"
2023/08/01 16:13:19 [error] 32576#32576: *13674935 connect() failed (111: Connection refused) while connecting to upstream, client: 10.150.1.24, server
: openqa.opensuse.org, request: "POST /api/v1/jobs/3473697/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473697/status", host: "openq
a.opensuse.org"

...
2023/08/01 16:13:26 [error] 32576#32576: *13675033 connect() failed (111: Connection refused) while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3473630/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3473630/status", host: "openqa.opensuse.org"
2023/08/01 16:13:26 [error] 32576#32576: *13675035 connect() failed (111: Connection refused) while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3472843/status HTTP/1.1", upstream: "http://127.0.0.1:9526/api/v1/jobs/3472843/status", host: "openqa.opensuse.org"

2023/08/01 16:13:30 [error] 32578#32578: *13675130 no live upstreams while connecting to upstream, client: 10.150.1.26, server: openqa.opensuse.org, request: "POST /api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfctr_nb+bpext+perfctr_llc+mwaitx+cpb+cat_l3+cdp_l3+invpcid_single+hw_pstate+ssbd+mba+ibrs+ibpb+stibp+vmmcall+fsgsbase+bmi1+avx2+smep+bmi2+invpcid+cqm+rdt_a+rdseed+adx+smap+clflushopt+clwb+sha_ni+xsaveopt+xsavec+xgetbv1+xsaves+cqm_llc+cqm_occup_llc+cqm_mbm_total+cqm_mbm_local+clzero+irperf+xsaveerptr+rdpru+wbnoinvd+amd_ppin+brs+arat+npt+lbrv+svm_lock+nrip_save+tsc_scale+vmcb_clean+flushbyasid+decodeassists+pausefilter+pfthreshold+v_vmsave_vmload+vgif+v_spec_ctrl+umip+pku+ospke+vaes+vpclmulqdq+rdpid+overflow_recov+succor+smca&cpu_modelname=AMD+EPYC+7773X+64-Core+Processor&cpu_opmode=32-bit%2C+64-bit&host=openqaworker23&instance=38&isotovideo_interface_version=39&mem_max=515376&websocket_api_version=1&worker_class=openqaworker23%2Cx86_64-v3%2Cqemu_x86_64_poo132134%2Cqemu_i586_poo132134%2Cheavyload_poo132134%2Cvagrant_poo132134%2Ctap%2Cwsl2_poo132134%2Cprg%2Cprg2%2Cprg2-j12 HTTP/1.1", upstream: "http://webui/api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfc
2023/08/01 16:13:30 [error] 32578#32578: *13675140 no live upstreams while connecting to upstream, client: 10.150.1.24, server: openqa.opensuse.org, request: "POST /api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfctr_nb+bpext+perfctr_llc+mwaitx+cpb+cat_l3+cdp_l3+invpcid_single+hw_pstate+ssbd+mba+ibrs+ibpb+stibp+vmmcall+fsgsbase+bmi1+avx2+smep+bmi2+invpcid+cqm+rdt_a+rdseed+adx+smap+clflushopt+clwb+sha_ni+xsaveopt+xsavec+xgetbv1+xsaves+cqm_llc+cqm_occup_llc+cqm_mbm_total+cqm_mbm_local+clzero+irperf+xsaveerptr+rdpru+wbnoinvd+amd_ppin+brs+arat+npt+lbrv+svm_lock+nrip_save+tsc_scale+vmcb_clean+flushbyasid+decodeassists+pausefilter+pfthreshold+v_vmsave_vmload+vgif+v_spec_ctrl+umip+pku+ospke+vaes+vpclmulqdq+rdpid+overflow_recov+succor+smca&cpu_modelname=AMD+EPYC+7773X+64-Core+Processor&cpu_opmode=32-bit%2C+64-bit&host=openqaworker21&instance=25&isotovideo_interface_version=39&mem_max=515586&websocket_api_version=1&worker_class=openqaworker21%2Cx86_64-v3%2Cqemu_x86_64%2Cqemu_i586%2Cheavyload%2Cvagrant%2Ctap_poo132134%2Cwsl2%2Cprg%2Cprg2%2Cprg2-j12 HTTP/1.1", upstream: "http://webui/api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfctr_nb+bpext+perfctr_llc+mwaitx+cpb+cat_l
...
2023/08/01 16:14:00 [error] 32579#32579: *13677319 no live upstreams while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, request: "POST /api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfctr_nb+bpext+perfctr_llc+mwaitx+cpb+cat_l3+cdp_l3+invpcid_single+hw_pstate+ssbd+mba+ibrs+ibpb+stibp+vmmcall+fsgsbase+bmi1+avx2+smep+bmi2+invpcid+cqm+rdt_a+rdseed+adx+smap+clflushopt+clwb+sha_ni+xsaveopt+xsavec+xgetbv1+xsaves+cqm_llc+cqm_occup_llc+cqm_mbm_total+cqm_mbm_local+clzero+irperf+xsaveerptr+rdpru+wbnoinvd+amd_ppin+arat+npt+lbrv+svm_lock+nrip_save+tsc_scale+vmcb_clean+flushbyasid+decodeassists+pausefilter+pfthreshold+v_vmsave_vmload+vgif+v_spec_ctrl+umip+pku+ospke+vaes+vpclmulqdq+rdpid+overflow_recov+succor+smca&cpu_modelname=AMD+EPYC+7543+32-Core+Processor&cpu_opmode=32-bit%2C+64-bit&host=openqaworker20&instance=7&isotovideo_interface_version=39&mem_max=515606&websocket_api_version=1&worker_class=openqaworker20%2Cx86_64-v3%2Cqemu_x86_64%2Cqemu_i586%2Cheavyload%2Cvagrant%2Cwsl2%2Cnue%2Cmaxtorhof%2Cnue1%2Cnue1-srv1 HTTP/1.1", upstream: "http://webui/api/v1/workers?cpu_arch=x86_64&cpu_flags=fpu+vme+de+pse+tsc+msr+pae+mce+cx8+apic+sep+mtrr+pge+mca+cmov+pat+pse36+clflush+mmx+fxsr+sse+sse2+ht+syscall+nx+mmxext+fxsr_opt+pdpe1gb+rdtscp+lm+constant_tsc+rep_good+nopl+nonstop_tsc+cpuid+extd_apicid+aperfmperf+rapl+pni+pclmulqdq+monitor+ssse3+fma+cx16+pcid+sse4_1+sse4_2+movbe+popcnt+aes+xsave+avx+f16c+rdrand+lahf_lm+cmp_legacy+svm+extapic+cr8_legacy+abm+sse4a+misalignsse+3dnowprefetch+osvw+ibs+skinit+wdt+tce+topoext+perfctr_core+perfctr_nb+bpext+perfctr_llc+mwaitx+cpb+cat_l3+cdp

edit: actually this is from a openqa-webui restart, so that's expected.

Actions #32

Updated by livdywan 9 months ago

maybe openqaworker23 was flooded with requests as part of #133496 since it shows up a lot in these logs

Actions #33

Updated by livdywan 9 months ago

  • Related to action #133496: [openqa-in-openQA] test fails due to not loaded install test modules added
Actions #34

Updated by livdywan 9 months ago

livdywan wrote:

maybe openqaworker23 was flooded with requests as part of #133496 since it shows up a lot in these logs

Correction, looks like it must've been #132134 since we can see that ticket in the worker class

Actions #35

Updated by livdywan 9 months ago

  • Related to action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Actions #36

Updated by livdywan 9 months ago

We ultimately decided against rate limits based on the instances we investigated. Most likely the floods of Updating seen of worker .* from worker_status are legit... I'll take a look at this code path anyway since optimizing it would be very beneficial and it's out best action item for now.

Actions #37

Updated by kraih 9 months ago

Trying to disable the accounting of failed attempts on O3 now with max_fails=0, since we only have one upstream this could prevent 502 responses in a few cases:

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30 max_fails=0;
}

Update: This also did not work. Which leaves the other extreme, max_fails=50 fail_timeout=30s, to be tested.

Actions #38

Updated by okurz 9 months ago

What I see missing is to ensure that nginx is waiting for even longer same as we expect that apache did.

Actions #39

Updated by livdywan 9 months ago

Things to confirm before we decide to bump the due date and wait for more feedback vs. coming up with more concrete action points. My comment with the notes from the call got lost so I'm trying to replicate it now:

  • nginx seems to blocklist the host if any worker instance is busy. Maybe we can use different ports?
  • rate limits are pointless because nginx already does what we want
  • worker_status shows up loads in logs, but is probably working as designed
  • webui restarts can cause surges of 502's

https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5719915990/job/15498813797?pr=140#step:5:148 is another occurence when using the openQA client. I am convinced that in before the proxy, i.e. apache instead of nginx, was more forgiving.

  • Did we have a conclusion on this case?

Commands used to search the logs:

sudo grep error /var/log/nginx/error.log
tail /space/logs/nginx/openqa.error_log
sudo xzgrep error /var/log/openqa.2.xz
Actions #40

Updated by livdywan 9 months ago

  • Copied to action #133769: hydra.opensuse.org causing excessive load on o3 size:S added
Actions #41

Updated by livdywan 9 months ago

  • Copied to action #133772: Munin making many requests on o3 size:M added
Actions #42

Updated by livdywan 9 months ago

More ideas

  • Try max_fails=50 fail_timeout=30s (is now in production)
  • Prepare a mini mojo app to reproduce broken worker instances (if super high max_fails value does not work)
  • Can we get the number of req's from nginx logs?
    • We can't get them from munix since it's every 5 minutes only
    • Maybe goaccess can help us
    • See https://rt.goaccess.io/?20230331192652 for the live demo
    • Run goaccess /space/logs/nginx/access.log on o3 and select NCSA Combined Log Format
    • "o" to expand
    • "s" to sort
    • Zypper repo data and rpm's
    • What is hydra.opensuse.org and why is it showing up
    • Many more requests of /server-status?auto from Munin even though it should be every 5 minutes

Btw logs are now in tail /space/logs/nginx/access.log (#133673)

Actions #43

Updated by livdywan 9 months ago

  • Status changed from Workable to Feedback
Actions #44

Updated by kraih 9 months ago

Some good news, i think with max_fails=50 fail_timeout=30s we are on the right track. Since yesterday we've only seen one case of no live upstream, which is a lot less than usual:

2023/08/03 15:17:57 [error] 8257#8257: *349749 no live upstreams while connecting to upstream, client: 10.150.1.25, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3479161/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3479161/status", host: "openqa.opensuse.org"
2023/08/03 15:17:57 [error] 8257#8257: *349754 no live upstreams while connecting to upstream, client: 10.150.1.24, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3479552/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3479552/status", host: "openqa.opensuse.org"
2023/08/03 15:17:57 [error] 8257#8257: *349757 no live upstreams while connecting to upstream, client: 10.150.1.24, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3478979/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3478979/status", host: "openqa.opensuse.org"
2023/08/03 15:17:57 [error] 8257#8257: *349765 no live upstreams while connecting to upstream, client: 10.150.1.25, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3479610/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3479610/status", host: "openqa.opensuse.org"

Lets see if doubling max_fails again makes it go away completely over the weekend.

Actions #45

Updated by livdywan 9 months ago

  • Due date changed from 2023-08-04 to 2023-08-11

Let's give it a few more days, with the above

Actions #46

Updated by kraih 9 months ago

Over the weekend there have been two larger batches of "no live upstream" errors that were not caused by service restarts:

2023/08/05 21:52:49 [error] 13199#13199: *16061757 no live upstreams while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, reque
st: "POST /api/v1/jobs/3485746/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3485746/artefact", host: "openqa.opensuse.org"
2023/08/05 21:52:51 [error] 13202#13202: *16062791 no live upstreams while connecting to upstream, client: 10.150.1.27, server: openqa.opensuse.org, request:
"POST /api/v1/jobs/3485258/artefact HTTP/1.1", upstream: "http://webui/api/v1/jobs/3485258/artefact", host: "openqa.opensuse.org"
...
2023/08/07 02:09:45 [error] 2766#2766: *2606291 no live upstreams while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, request:
 "POST /api/v1/jobs/3487305/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3487305/status", host: "openqa.opensuse.org"
2023/08/07 02:09:45 [error] 2766#2766: *2606295 no live upstreams while connecting to upstream, client: 195.135.221.150, server: openqa.opensuse.org, request:
 "POST /api/v1/jobs/3487448/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3487448/status", host: "openqa.opensuse.org"
...

Both seem to be fairly close to 240+ requests per second spikes in the Munin data for nginx. That means even with a large max_fails=1000 fail_timeout=30s value it can't catch up, but "no live upstream" errors do happen much less frequently. However, i noticed another interesting data point in the corresponding 502 response entries in the access log:

195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487305/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487448/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487111/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487492/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487237/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
195.135.221.150 - - [07/Aug/2023:02:09:45 +0000] "POST /api/v1/jobs/3487233/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"

Those rt and urt values mean that there hasn't even been an attempt at waiting for an upstream connection to become available. So figuring out how to get nginx to wait in this case will require some experimentation.

Actions #47

Updated by kraih 9 months ago

I'll run local nginx experiments this week to figure out which timeouts we really need to change.

Update: Was pretty easy to replicate the issue locally, and i already learned that max_fails and fail_timeout do absolutely nothing here and were just a red herring. Nginx disables both parameters if there is only one server.
Update2: In theory max_conns is exactly what we want for protecting the app server, but there appears to be no way to prevent excess requests from immediately receiving a 5xx response if there are no more free connections in the free version of Nginx. The queue feature that is currently only available with a commercial subscription seems to be required to actually make Nginx wait for upstream connections to become available. That means our best option forward is probably to remove max_conns and rate limit in other ways.
Update3: I've removed max_conns on O3 completely for now to collect new data for what it would look like without the brute force rate limiting after all the recent optimisations. Check error messages after 2023/08/08 14:54:29.

Actions #48

Updated by livdywan 9 months ago

  • Assignee changed from livdywan to kraih

Maybe at this point you should take the ticket. I can't claim to be doing all of this ;-)

Actions #49

Updated by kraih 9 months ago

Since max_conns has been removed we've not seen any no live upstreams errors that were not related to an openqa-webui service restart. I will make a patch for the default Nginx config to remove max_conns there again. Then take a look at the worker upload code to see if we need to add proper retry for 502 in addition to 503.

Actions #50

Updated by kraih 9 months ago

Apache returns 503 if the openQA webUI, why does nginx return 502?

To answer this question from the suggestions section, this is simply how nginx works, it considers no live upstreams being available as a 502 bad gateway error. And it will happen when openqa-webui gets restarted. So we need to deal with it to support nginx.

Actions #51

Updated by kraih 9 months ago

  • Status changed from Feedback to In Progress
Actions #52

Updated by kraih 9 months ago

Actions #53

Updated by kraih 9 months ago

  • Due date changed from 2023-08-11 to 2023-08-18

Bumping the due date since i still need to look at the retry code.

Actions #54

Updated by kraih 9 months ago

PR to make the upload retry log messages bit more detailed: https://github.com/os-autoinst/openQA/pull/5280

Actions #55

Updated by kraih 9 months ago

Ensure that openQA jobs do not incomplete in upload phase when the webUI is absent for some seconds

The asset upload code is configured with 5 attempts and 5 second sleeps in between attempts for each chunk.

Investigate what took the webUI 1.5m to restart

I cannot answer this, looking through the logs it seems the downtimes have been pretty consistent at only a few seconds for normal updates and about 30 seconds when a full reboot was required.

Jul 26 22:22:36 new-ariel systemd[1]: Stopping The openQA web UI...
Jul 26 22:22:37 new-ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Jul 26 22:22:37 new-ariel systemd[1]: Stopped The openQA web UI.
Jul 26 22:22:37 new-ariel systemd[1]: Started The openQA web UI.
Jul 26 22:22:39 new-ariel openqa-webui-daemon[31984]: Web application available at http://127.0.0.1:9526
Jul 29 03:00:17 new-ariel systemd[1]: Stopping The openQA web UI...
Jul 29 03:00:17 new-ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Jul 29 03:00:17 new-ariel systemd[1]: Stopped The openQA web UI.
Jul 29 03:00:18 new-ariel systemd[1]: Started The openQA web UI.
Jul 29 03:00:20 new-ariel openqa-webui-daemon[16934]: Web application available at http://127.0.0.1:9526
Jul 29 06:22:18 new-ariel systemd[1]: Stopping The openQA web UI...
Jul 29 06:22:18 new-ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Jul 29 06:22:18 new-ariel systemd[1]: Stopped The openQA web UI.
Jul 29 06:22:22 new-ariel systemd[1]: Started The openQA web UI.
Jul 26 09:04:51 new-ariel systemd[1]: Stopped The openQA web UI.
-- Boot bbbfaf4f83884377b7872e905d15f1aa --
Jul 26 09:05:18 new-ariel systemd[1]: Started The openQA web UI.
Jul 26 09:08:54 new-ariel systemd[1]: Stopped The openQA web UI.
-- Boot f212eca01e65453a817d7ff2deea3e83 --
Jul 26 09:09:19 new-ariel systemd[1]: Started The openQA web UI.
Aug 06 03:30:01 new-ariel systemd[1]: Stopped The openQA web UI.
-- Boot 6bffce71fac047bf8ed7231e42e14c13 --
Aug 06 03:30:32 new-ariel systemd[1]: Started The openQA web UI.

But of course this can be longer than the 5 retry attempts we have right now. So we should probably increase the timeout or number of retries as well.

Actions #56

Updated by kraih 9 months ago

PR to make attempts configurable: https://github.com/os-autoinst/openQA/pull/5281

Actions #57

Updated by tinita 9 months ago

Starting at 11/Aug/2023:22:12:18, until 11/Aug/2023:22:12:39, we had over 700 instances of 502 on o3.

We were notified because of this failed job: https://openqa.opensuse.org/tests/3501150

During that time openQA was updated.

Aug 11 22:12:18 new-ariel systemd[1]: Stopping Handler for live view in openQA's web UI...
Aug 11 22:12:18 new-ariel systemd[1]: Stopping The openQA web UI...
Aug 11 22:12:18 new-ariel openqa-gru[25328]: timeout: sending signal TERM to command ‘env’
Aug 11 22:12:18 new-ariel openqa-livehandler-daemon[19998]: Web application available at http://127.0.0.1:9528
Aug 11 22:12:18 new-ariel systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Aug 11 22:12:18 new-ariel systemd[1]: openqa-enqueue-audit-event-cleanup.timer: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped Enqueues an audit event cleanup task for the openQA database every day..
Aug 11 22:12:18 new-ariel systemd[1]: Stopping Enqueues an audit event cleanup task for the openQA database every day....
Aug 11 22:12:18 new-ariel systemd[1]: Started Enqueues an audit event cleanup task for the openQA database every day..
Aug 11 22:12:18 new-ariel systemd[1]: openqa-enqueue-asset-cleanup.timer: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped Enqueues an asset cleanup task for openQA every hour..
Aug 11 22:12:18 new-ariel systemd[1]: Stopping Enqueues an asset cleanup task for openQA every hour....
Aug 11 22:12:18 new-ariel systemd[1]: Started Enqueues an asset cleanup task for openQA every hour..
Aug 11 22:12:18 new-ariel systemd[1]: openqa-enqueue-result-cleanup.timer: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped Enqueues a result/screenshots/logs cleanup task for openQA every day..
Aug 11 22:12:18 new-ariel systemd[1]: Stopping Enqueues a result/screenshots/logs cleanup task for openQA every day....
Aug 11 22:12:18 new-ariel systemd[1]: Started Enqueues a result/screenshots/logs cleanup task for openQA every day..
Aug 11 22:12:18 new-ariel systemd[1]: openqa-enqueue-bug-cleanup.timer: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped Enqueues a bug cleanup task for the openQA database every week..
Aug 11 22:12:18 new-ariel systemd[1]: Stopping Enqueues a bug cleanup task for the openQA database every week....
Aug 11 22:12:18 new-ariel systemd[1]: Started Enqueues a bug cleanup task for the openQA database every week..
Aug 11 22:12:18 new-ariel systemd[1]: openqa-livehandler.service: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped Handler for live view in openQA's web UI.
Aug 11 22:12:18 new-ariel systemd[1]: Started Handler for live view in openQA's web UI.
Aug 11 22:12:18 new-ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped The openQA web UI.
Aug 11 22:12:18 new-ariel systemd[1]: Stopping The openQA WebSockets server...
Aug 11 22:12:18 new-ariel openqa-websockets-daemon[20010]: Web application available at http://127.0.0.1:9527
Aug 11 22:12:18 new-ariel systemd[1]: openqa-websockets.service: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped The openQA WebSockets server.
Aug 11 22:12:18 new-ariel systemd[1]: Stopping The openQA Scheduler...
Aug 11 22:12:18 new-ariel openqa-scheduler-daemon[20009]: Web application available at http://127.0.0.1:9529
Aug 11 22:12:18 new-ariel systemd[1]: openqa-scheduler.service: Deactivated successfully.
Aug 11 22:12:18 new-ariel systemd[1]: Stopped The openQA Scheduler.
Aug 11 22:12:38 new-ariel systemd[1]: openqa-gru.service: Deactivated successfully.
Aug 11 22:12:38 new-ariel systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Aug 11 22:12:38 new-ariel systemd[1]: Starting Setup local PostgreSQL database for openQA...
Aug 11 22:12:38 new-ariel systemd[1]: openqa-setup-db.service: Deactivated successfully.
Aug 11 22:12:38 new-ariel systemd[1]: Finished Setup local PostgreSQL database for openQA.
Aug 11 22:12:38 new-ariel systemd[1]: Started The openQA Scheduler.
Aug 11 22:12:38 new-ariel systemd[1]: Started The openQA WebSockets server.
Aug 11 22:12:38 new-ariel systemd[1]: Started The openQA web UI.
Aug 11 22:12:38 new-ariel systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
Aug 11 22:12:38 new-ariel systemd[1]: Reloading.
Aug 11 22:12:39 new-ariel openqa-continuous-update[24737]: (4/7) Installing: openQA-local-db-4.6.1691790324.4fe4f04-lp154.6004.1.x86_64 [....done]
Aug 11 22:12:39 new-ariel openqa-webui-daemon[25425]: Web application available at http://127.0.0.1:9526
Actions #58

Updated by kraih 9 months ago

tinita wrote:

During that time openQA was updated.

Aug 11 22:12:18 new-ariel systemd[1]: Stopping The openQA web UI...
...
Aug 11 22:12:39 new-ariel openqa-webui-daemon[25425]: Web application available at http://127.0.0.1:9526

That was really unfortunate timing with the update, a few more seconds and it would have been fine:

[2023-08-12T00:12:18.468128+02:00] [error] Failed uploading asset
[2023-08-12T00:12:18.468341+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@-:TW.22162.qcow2: 502 response: Bad Gateway
[2023-08-12T00:12:18.469225+02:00] [error] Upload failed for chunk 2198
[2023-08-12T00:12:23.588445+02:00] [error] Failed uploading asset
[2023-08-12T00:12:23.588664+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@-:TW.22162.qcow2: 502 response: Bad Gateway
[2023-08-12T00:12:23.589536+02:00] [error] Upload failed for chunk 2198
[2023-08-12T00:12:28.771870+02:00] [error] Failed uploading asset
[2023-08-12T00:12:28.772124+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@-:TW.22162.qcow2: 502 response: Bad Gateway
[2023-08-12T00:12:28.772996+02:00] [error] Upload failed for chunk 2198
[2023-08-12T00:12:33.889737+02:00] [error] Failed uploading asset
[2023-08-12T00:12:33.889891+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@-:TW.22162.qcow2: 502 response: Bad Gateway
[2023-08-12T00:12:33.890357+02:00] [error] Upload failed for chunk 2198
[2023-08-12T00:12:38.992672+02:00] [error] Failed uploading asset
[2023-08-12T00:12:38.992895+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@-:TW.22162.qcow2: 502 response: Bad Gateway
[2023-08-12T00:12:38.993774+02:00] [error] Upload failed for chunk 2198
[2023-08-12T00:12:43.994440+02:00] [error] Upload failed, and all retry attempts have been exhausted

So we need at least one more attempt here, or a little more time in between attempts (possibly via dynamic backoff).

Actions #59

Updated by kraih 9 months ago

PR for upload delays with backoff: https://github.com/os-autoinst/openQA/pull/5286

Actions #60

Updated by kraih 9 months ago

  • Status changed from In Progress to Feedback

Now lets wait for all the changes to be deployed.

Actions #61

Updated by tinita 9 months ago

  • Due date changed from 2023-08-18 to 2023-08-25
Actions #62

Updated by kraih 9 months ago

Not much activity over the weekend according to the logs ands it's a pretty slow monday, so no new data yet. All patches are deployed to O3 workers now though.

Actions #63

Updated by tinita 9 months ago

  • Due date changed from 2023-08-25 to 2023-08-30
Actions #64

Updated by kraih 9 months ago

We had a few 502 responses during the service restart yesterday, checked all tests that were affected and it looks like they all recovered fine. No real need for the backoff yet though.

Actions #65

Updated by kraih 8 months ago

  • Status changed from Feedback to Resolved

I think we can consider this resolved now.

Actions

Also available in: Atom PDF