Project

General

Profile

Actions

action #132167

closed

asset uploading failed with http status 502 size:M

Added by tinita 11 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-06-29
Due date:
2023-08-30
% Done:

0%

Estimated time:

Description

Observation

This job http://openqa.opensuse.org/tests/3393876 failed in the uploading phase

[2023-06-29T18:11:42.386959+02:00] [error] Failed uploading asset
[2023-06-29T18:11:42.387207+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@64bit-2G-:TW.21297.qcow2: 502 response: Bad Gateway

At that time the webui was apparently stopped and restarted:

Jun 29 16:11:42 ariel systemd[1]: Stopping The openQA web UI...
Jun 29 16:11:42 ariel systemd[1]: openqa-webui.service: Deactivated successfully.
Jun 29 16:11:42 ariel systemd[1]: Stopped The openQA web UI.
Jun 29 16:13:12 ariel systemd[1]: Started The openQA web UI.

I would expect nginx to return a 503 in that case, and the asset uploading should retry.

From /var/log/zypp/history we can see that the service was restarted during package upgrade:

2023-06-29 16:11:33|command|root@ariel|'zypper' '-n' '--no-refresh' 'dup' '--auto-agree-with-licenses' '--download-in-advance'|
2023-06-29 16:11:34|install|openQA-common|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|1791b2e78db52c05fd9adc1ace2ac7b4451ec328d1558141796d90b8ec812c52|
2023-06-29 16:11:34|install|openQA-munin|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|b35e7151cd7b3b7f4224ae5244763267d29a20a48b5ea64a85ea397b13219bd6|
2023-06-29 16:11:36|install|openQA-continuous-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|910b7875523e37ae0db7c76cab74fe9d4113fcd2c07fe4a4578e2fd7a67c2d21|
2023-06-29 16:11:37|install|openQA-client|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|490b772224e5a64f354db398199ab23789fa7307d5edd8c561381a7dfe107e60|
2023-06-29 16:11:38|install|openQA-auto-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|6699637a855eccc6d5420c0c1129623da9c33096b25130ec9914963c9a526065|
2023-06-29 16:13:13|install|openQA|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|d137cfea77fc8567cfd7f9d5a5ccb372a54670da7e33ce5a3de293d53bb3c7f4|
2023-06-29 16:13:14|install|openQA-local-db|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|352062e0def659d7b349084300465b0161e5083b5a167a41ca49b312eb6cca63|

But 1.5m is quite long. What caused that?

From /var/log/openqa.1.xz:

[2023-06-29T16:11:42.554141Z] [info] [pid:890] Worker 661 websocket connection closed - 1006
[2023-06-29T16:11:42.554307Z] [info] [pid:890] Worker 642 websocket connection closed - 1006
[2023-06-29T16:11:42.554567Z] [info] [pid:890] Worker 505 websocket connection closed - 1006
[2023-06-29T16:11:45.099867Z] [info] Listening at "http://127.0.0.1:9528" <-- liveview handler
[2023-06-29T16:11:45.100547Z] [info] Listening at "http://[::1]:9528"
(90s delay)
[2023-06-29T16:13:14.544271Z] [info] Listening at "http://127.0.0.1:9527" <-- websocket
[2023-06-29T16:13:14.544929Z] [info] Listening at "http://[::1]:9527"
[2023-06-29T16:13:14.918142Z] [info] Loading external plugin AMQP
[2023-06-29T16:13:14.973300Z] [info] Loading external plugin ObsRsync
[2023-06-29T16:13:15.527287Z] [info] Listening at "http://127.0.0.1:9526" <-- webUI
[2023-06-29T16:13:15.527842Z] [info] Listening at "http://[::1]:9526"
[2023-06-29T16:13:15.527939Z] [info] Manager 8148 started
[2023-06-29T16:13:15.537953Z] [info] Worker 8301 started

From /var/log/openqa_scheduler

[2023-06-29T16:11:42.131776Z] [debug] [pid:889] Scheduler took 0.77747s to perform operations and allocated 0 jobs
[2023-06-29T16:11:42.131955Z] [debug] [pid:889] [rescheduling] Current tick is at 0 ms. New tick will be in: 20000 ms
(90s delay)
[2023-06-29T16:13:14.675640Z] [info] Listening at "http://127.0.0.1:9529"
[2023-06-29T16:13:14.676281Z] [info] Listening at "http://[::1]:9529"

Where does the 90s delay come from?

Maybe we need to bump the timeouts within nginx to be a bit longer what we had in apache?

Acceptance criteria

  • AC1: Automatic package upgrades of openQA cause no confusing error reports to users
  • AC2: Automatic package upgrades of openQA do not cause incomplete openQA jobs

Suggestions

  • Apache returns 503 if the openQA webUI, why does nginx return 502?
  • https://www.section.io/blog/504-503-errors-difference/
  • Ensure that openQA jobs do not incomplete in upload phase when the webUI is absent for some seconds
  • Investigate what took the webUI 1.5m to restart

Files


Related issues 6 (0 open6 closed)

Related to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Related to openQA Project - action #133496: [openqa-in-openQA] test fails due to not loaded install test modulesResolvedokurz2023-07-28

Actions
Related to openQA Infrastructure - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Copied to openQA Project - action #133673: nginx on ariel logs in multiple locations size:MResolvedokurz

Actions
Copied to openQA Project - action #133769: hydra.opensuse.org causing excessive load on o3 size:SResolvedlivdywan2023-06-292023-08-04

Actions
Copied to openQA Project - action #133772: Munin making many requests on o3 size:MResolvedtinita

Actions
Actions

Also available in: Atom PDF