Project

General

Profile

Actions

action #175686

closed

coordination #161414: [epic] Improved salt based infrastructure management

OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webui

Added by okurz 19 days ago. Updated 19 days ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webui with systemctl restart openqa-webui by okurz.


Related issues 3 (1 open2 closed)

Related to openQA Infrastructure (public) - action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:SResolvednicksinger2025-01-16

Actions
Copied to openQA Infrastructure (public) - action #175689: monitor.qe.nue2.suse.org "502 Bad Gateway" from nginx on 2025-01-17, missing grafana server files?Resolvednicksinger2025-01-17

Actions
Copied to openQA Infrastructure (public) - action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:SBlockeddheidler2025-01-17

Actions
Actions #1

Updated by okurz 19 days ago · Edited

  • Tags set to infra, osd, reactive work
  • Status changed from New to In Progress

journalctl -u openqa-webui reveals that the openqa-webui service was first triggered to be reloaded but then ended up failing:

Jan 17 08:52:19 openqa systemd[1]: Reloading The openQA web UI...
Jan 17 08:52:19 openqa systemd[1]: Reloading The openQA web UI...
Jan 17 08:52:19 openqa systemd[1]: Reloaded The openQA web UI.
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]: [debug] [9fMblsYnqOwn] looking for "autoinst-log.txt" in [
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]:   "/var/lib/openqa/testresults/16466/16466106-sle-15-SP6-Online-QR-SAP-x86_64-sles4sap_nw_node02:investigate:retry\@64bit-sap-qam",
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]:   "/var/lib/openqa/testresults/16466/16466106-sle-15-SP6-Online-QR-SAP-x86_64-sles4sap_nw_node02:investigate:retry\@64bit-sap-qam/ulogs",
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]: ]
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]: [debug] [9fMblsYnqOwn] found bless({
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]:   path => "/var/lib/openqa/testresults/16466/16466106-sle-15-SP6-Online-QR-SAP-x86_64-sles4sap_nw_node02:investigate:retry\@64bit-sap-qam/autoinst-log.txt",
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]:   pid  => 1633,
Jan 17 08:52:23 openqa openqa-webui-daemon[1633]: }, "Mojo::Asset::File")
Jan 17 08:52:25 openqa openqa-webui-daemon[1253]: [info] Worker 5299 stopped
Jan 17 08:52:25 openqa openqa-webui-daemon[16648]: [info] Worker 16648 started
Jan 17 08:52:29 openqa openqa-webui-daemon[16473]: [info] Listening at "http://127.0.0.1:9526?reuse=1"
Jan 17 08:52:29 openqa openqa-webui-daemon[16473]: Web application available at http://127.0.0.1:9526
Jan 17 08:52:29 openqa openqa-webui-daemon[16473]: [info] Listening at "http://[::1]:9526?reuse=1"
Jan 17 08:52:29 openqa openqa-webui-daemon[16473]: Web application available at http://[::1]:9526
Jan 17 08:52:29 openqa openqa-webui-daemon[16473]: [info] Manager 16473 started
Jan 17 08:52:29 openqa openqa-webui-daemon[16848]: [info] Worker 16848 started
Jan 17 08:52:29 openqa openqa-webui-daemon[16849]: [info] Worker 16849 started
…
an 17 08:52:31 openqa openqa-webui-daemon[1253]: [info] Worker 7314 stopped
Jan 17 08:52:31 openqa openqa-webui-daemon[1253]: [info] Manager 1253 stopped
Jan 17 08:52:32 openqa systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
Jan 17 08:52:32 openqa systemd[1]: openqa-webui.service: Consumed 1month 1w 2d 13h 55min 16.642s CPU time.

Due to the timely coincidence I assume this was related to me merging https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1339 which triggered application of salt states in https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/1517746 which failed in "deploy" https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3676441

Actions #2

Updated by okurz 19 days ago

  • Related to action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S added
Actions #3

Updated by okurz 19 days ago

  • Parent task set to #161414
Actions #4

Updated by okurz 19 days ago

  • Copied to action #175689: monitor.qe.nue2.suse.org "502 Bad Gateway" from nginx on 2025-01-17, missing grafana server files? added
Actions #5

Updated by okurz 19 days ago · Edited

jlausuch reported that the "OBS sync" menu item in the OSD web menu has vanished. I realized that again /etc/openqa/openqa.ini was corrupted. Fixed by copying back from backup

ssh root@backup-vm.qe.nue2.suse.org "cat /home/rsnapshot/delta.0/openqa.suse.de/etc/openqa/openqa.ini" | ssh osd "cat - | sudo tee /etc/openqa/openqa.ini"
Actions #6

Updated by okurz 19 days ago

  • Copied to action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S added
Actions #7

Updated by jbaier_cz 19 days ago

I guess "HTTP endpoint does not properly work" alert is just late to the party and not an actual problem anymore (the mentioned url returns ok right now).

okurz wrote in #note-5:

jlausuch reported that the "OBS sync" menu item in the OSD web menu has vanished. I realized that again /etc/openqa/openqa.ini was corrupted. Fixed by copying back from backup

ssh root@backup-vm.qe.nue2.suse.org "cat /home/rsnapshot/delta.0/openqa.suse.de/etc/openqa/openqa.ini" | ssh osd "cat - | sudo tee /etc/openqa/openqa.ini"

Salt misbehaving again? The latest deploy pipelines shows some stuck minion jobs, so we might have inconsistencies on multiple places?

Actions #8

Updated by okurz 19 days ago

  • Status changed from In Progress to Resolved

jbaier_cz wrote in #note-7:

I guess "HTTP endpoint does not properly work" alert is just late to the party and not an actual problem anymore (the mentioned url returns ok right now).

okurz wrote in #note-5:

jlausuch reported that the "OBS sync" menu item in the OSD web menu has vanished. I realized that again /etc/openqa/openqa.ini was corrupted. Fixed by copying back from backup

ssh root@backup-vm.qe.nue2.suse.org "cat /home/rsnapshot/delta.0/openqa.suse.de/etc/openqa/openqa.ini" | ssh osd "cat - | sudo tee /etc/openqa/openqa.ini"

Salt misbehaving again?

To be handled in #175710

The latest deploy pipelines shows some stuck minion jobs, so we might have inconsistencies on multiple places?

yes, handled in #175629

Actions

Also available in: Atom PDF