Project

General

Profile

Actions

action #161309

closed

osd not accessible, 502 Bad Gateway

Added by jbaier_cz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2024-05-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

Users pointed out in https://suse.slack.com/archives/C02CANHLANP/p1717141066882819 that osd is down, the problem was confirmed. As the nginx was complaining about its upstream, openqa-webui.service was restarted to remedy the situation. After that, the following error was seen in the logs

May 31 09:39:52 openqa openqa[19648]: Error when trying to get the database version: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or near "PRAGMA"
May 31 09:39:52 openqa openqa[19648]: LINE 1: PRAGMA synchronous = OFF
May 31 09:39:52 openqa openqa[19648]:         ^ at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line>

No service restart helped. As there was also a glibc update involved we decided to ensure consistency with a full system reboot as commented in #161309-1

There was a deployment running https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/1149031 just before the incident, the issue can be already seen in the telegraf post-deploy job.

openqa.suse.de:
    2024-05-31T07:33:42Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: received status code 502 (Bad Gateway), expected any value out of [200]
    2024-05-31T07:33:42Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/minion]: received status code 502 (Bad Gateway), expected any value out of [200]
    2024-05-31T07:33:45Z E! [telegraf] Error running agent: input plugins recorded 2 errors

Rollback actions

* delete deploy freeze for today in bot-ng to re-enable pipeline scheduling DONE


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure (public) - action #156460: Potential FS corruption on osd due to 2 VMs accessing the same diskResolvednicksinger2024-03-01

Actions
Copied to openQA Infrastructure (public) - action #161318: Ensure we have a consistent racktables entry for OSDResolvedokurz2024-05-31

Actions
Copied to openQA Infrastructure (public) - action #161321: OSD status overview in monitoring shows osd is online despite the whole machine being downNew2024-05-31

Actions
Copied to openQA Infrastructure (public) - action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"Resolvedokurz2024-05-31

Actions
Copied to openQA Infrastructure (public) - action #162332: 2024-06-15 osd not accessible size:MResolvedokurz

Actions
Actions #1

Updated by okurz 7 months ago · Edited

Problem analysis

openqa:~ # grep 2024-05-31 /var/log/zypp/history
2024-05-31 02:36:50|command|root@openqa|'zypper' '-n' '--no-refresh' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--download-in-advance'|
2024-05-31 02:36:50|install|apache2-utils|2.4.51-150400.6.17.1|x86_64||repo-sle-update|63ca3ca805203549551b22388014034d33f9b9a31c42f40443f8e9425222327b|
2024-05-31 02:36:51|install|dwz|0.12-150000.3.8.1|x86_64||repo-sle-update|2275abd9d4af34208b5a4a2eaef22840a79f87f16c4d164d88046cad05e1876d|
2024-05-31 02:36:51|install|libpython2_7-1_0|2.7.18-150000.65.1|x86_64||repo-sle-update|1f814ddcfb2c47b966eacda314ef92291648c6d62df34b7cef3968d373251b02|
2024-05-31 02:36:52|install|python3-Jinja2|2.10.1-150000.3.13.1|noarch||repo-sle-update|affc3829e86cae5713ba714fdb438564e17f84a9943ca6a7281b76102e8ee87b|
# 2024-05-31 02:36:54 apache2-2.4.51-150400.6.17.1.x86_64.rpm installed ok
2024-05-31 02:36:54|install|apache2|2.4.51-150400.6.17.1|x86_64||repo-sle-update|e070e1cc1a78db72fb0a23b66f65bc4761f01f27973c986d3a5e48742b06ae32|
# 2024-05-31 02:36:54 apache2-prefork-2.4.51-150400.6.17.1.x86_64.rpm installed ok
2024-05-31 02:36:54|install|apache2-prefork|2.4.51-150400.6.17.1|x86_64||repo-sle-update|a42f4e5544a8db9502d8678ec7d7474b9242b5754a67e32eab7f5358c8320879|
2024-05-31 02:36:56|install|python-base|2.7.18-150000.65.1|x86_64||repo-sle-update|49e8a7867aa17e837194e20e82236879f8ebca28da780575101146ec663d4fdc|
2024-05-31 02:36:56|install|python-xml|2.7.18-150000.65.1|x86_64||repo-sle-update|d3c2705e706b6bf3ec949284e20981189c287a63bc35d3854b77b1810442d954|
2024-05-31 02:36:56|install|python-curses|2.7.18-150000.65.1|x86_64||repo-sle-update|c2052b61d802d46644ae4da746cfea295a021b4c1ce326a709a40fe2172fac9e|
2024-05-31 02:36:56|install|python|2.7.18-150000.65.1|x86_64||repo-sle-update|6a9533c9135fde0f35fa701328a29227aa29fcdfb2c46bf0b105fb14bb6c7309|
# 2024-05-31 02:36:57 %posttrans(apache2-2.4.51-150400.6.17.1.x86_64) script output:
2024-05-31 02:36:57|patch  |openSUSE-SLE-15.5-2024-1851|1|noarch|repo-sle-update|low|recommended|needed|applied|
2024-05-31 02:36:57|patch  |openSUSE-SLE-15.5-2024-1868|1|noarch|repo-sle-update|important|security|needed|applied|
2024-05-31 02:36:57|patch  |openSUSE-SLE-15.5-2024-1862|1|noarch|repo-sle-update|moderate|security|needed|applied|
2024-05-31 02:36:57|patch  |openSUSE-SLE-15.5-2024-1863|1|noarch|repo-sle-update|moderate|security|needed|applied|
2024-05-31 07:14:38|command|root@openqa|'zypper' '--no-refresh' '-n' 'dup' '--replacefiles'|
2024-05-31 07:14:39|install|openQA-common|4.6.1717018874.26363281-lp155.6714.1|x86_64||devel_openQA|efda0e86932f8f11dab1ad815aabe18744a3261c451124019ba59f188e6bb7ae|
2024-05-31 07:14:39|install|openQA-doc|4.6.1717018874.26363281-lp155.6714.1|x86_64||devel_openQA|8c2f7851c9a2b07b7f665d3fb139d8a7272cb78b74bb915ed89c3ccb73be1a53|
2024-05-31 07:14:39|install|openQA-client|4.6.1717018874.26363281-lp155.6714.1|x86_64||devel_openQA|b5d74775ed2b1586db628640c0943b81b353b6224b408105cd369082869b613a|
2024-05-31 07:14:54|install|openQA|4.6.1717018874.26363281-lp155.6714.1|x86_64||devel_openQA|3409fdf0a05f6567d1dc6896ef64d5a79dfd6803e4d72bbe1ea226a6b47b54ed|
2024-05-31 07:14:55|install|openQA-local-db|4.6.1717018874.26363281-lp155.6714.1|x86_64||devel_openQA|2ef47a2a8d113e6e3e0ef237a95aab184ec8ec3b4dffd2a5bdefd6865d9cc123|

does not list anything critical.

From https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1717140106926&to=1717141358515 we can see that the last valid data point for the queue of openQA jobs is from 0732Z, before problems were first mentioned in https://suse.slack.com/archives/C02CANHLANP/p1717141066882819

07:37Z (Wayne Chen) openqa.suse.de can not be reached. Error "502 Bad Gateway".

Created #161318 for the missing racktables entry.

We triggered a reboot and now the VM does not come up anymore, created
https://sd.suse.com/servicedesk/customer/portal/1/SD-158390

Actions #2

Updated by okurz 7 months ago

  • Status changed from New to In Progress
Actions #3

Updated by okurz 7 months ago

  • Copied to action #161318: Ensure we have a consistent racktables entry for OSD added
Actions #4

Updated by jbaier_cz 7 months ago

  • Description updated (diff)
Actions #5

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #6

Updated by jbaier_cz 7 months ago

  • Copied to action #161321: OSD status overview in monitoring shows osd is online despite the whole machine being down added
Actions #7

Updated by jbaier_cz 7 months ago

PRAGMA mentioned in the error message was added in 2013 by https://github.com/os-autoinst/openQA/commit/907c4c85de4b0e5f7b83ce58b37766d07a30cb5b

Actions #8

Updated by okurz 7 months ago · Edited

mcaj contacted me by DM:

(Martin Caj) One service is not working and stop the boot process. To fix it I need to know root password, that you can share with me, ok I can change it via mount/chroot the VM by force.
(Oliver Kurz) I guess the root password is known by you and not by us as we don't have any access besides ssh anyway. So maybe one you know already anyway for VMs? Or change it, up to you
(Martin Caj) no we do not keep that password in tou team pass
(Oliver Kurz) we never had that password as we don't have access to the hypervisor. Please change it and save it on your side. Or give us hypervisor access and then we will take care
(Martin Caj) VM is up, there is a problem with a disk that you have in fstab. for now I commented it out
(Oliver Kurz) please call systemctl stop openqa-{scheduler,webui} until the disk is available again to prevent data inconsistency (edited)
(Martin Caj) looks like a network is broken there. also file system is RO ... what happen there ?
(Oliver Kurz) a file system turning RO could be corruption due to multiple VMs accessing the same ressources. Could that be?
(Martin Caj) I will run fsck and check all and let you know, but it might take some time ...
(Oliver Kurz) fsck on the xfs filesystems? But I guess you could bring the system up and reachable over network without those additional partitions. They are not necessary to bring up the network. Depending on which disk has problems you can also consider to rollback to the latest backup snapshot on the netapp storage, e.g. for the root filesystem and possibly the "archive" one which you mentioned has problems but keep the others unchanged. Today less critical test results have been gathered so a data loss of some hours could be acceptable

Actions #9

Updated by okurz 7 months ago

  • Related to action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added
Actions #10

Updated by okurz 7 months ago

  • Copied to action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway" added
Actions #11

Updated by kraih 7 months ago

jbaier_cz wrote in #note-7:

PRAGMA mentioned in the error message was added in 2013 by https://github.com/os-autoinst/openQA/commit/907c4c85de4b0e5f7b83ce58b37766d07a30cb5b

That PRAGMA is specifically for SQLite, but on the webui side we only use PostgreSQL, for a long long time.

Actions #12

Updated by kraih 7 months ago · Edited

kraih wrote in #note-11:

That PRAGMA is specifically for SQLite, but on the webui side we only use PostgreSQL, for a long long time.

Here's the two commits that removed the PRAGMA query back in 2017 together with SQLite support, it was actually in database.ini, not the migrations themselves: https://github.com/os-autoinst/openQA/commit/98756cd18437abd6a70261f8dd36d6e0703597bb, https://github.com/os-autoinst/openQA/commit/5a802a4aac6b6d40a5c8608019c5c075b1efa3dd

Actions #13

Updated by jbaier_cz 7 months ago

  • Description updated (diff)
Actions #14

Updated by jbaier_cz 7 months ago

machine was briefly up as we got a new telegraf data into our monitoring. Failed systemd services panel suggests the recovery is not done yet.

Last update            Host      Failing units                                                                                                                      # failed services
2024-05-31 13:08:20    openqa    auditd, augenrules, kdump, logrotate-openqa, nginx, openqa-enqueue-asset-cleanup, postfix, postgresql, salt-master, smb, vsftpd    11
Actions #15

Updated by jbaier_cz 7 months ago

Additional info from logs. The problem seems to start with restarting the webui which (likely) corresponds to the deploy with salt in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2672330

May 31 09:32:57 openqa systemd[1]: Started /usr/bin/systemctl restart openqa-webui.service.
May 31 09:32:57 openqa systemd[1]: Stopping Handler for live view in openQA's web UI...
May 31 09:32:57 openqa systemd[1]: Stopping The openQA web UI...
May 31 09:32:58 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
May 31 09:32:58 openqa systemd[1]: Stopped The openQA web UI.
May 31 09:32:58 openqa systemd[1]: Starting Setup local PostgreSQL database for openQA...
May 31 09:32:58 openqa systemd[1]: openqa-setup-db.service: Deactivated successfully.
May 31 09:32:58 openqa systemd[1]: Finished Setup local PostgreSQL database for openQA.
May 31 09:32:58 openqa systemd[1]: Started The openQA web UI.
May 31 09:32:58 openqa systemd[1]: run-r277104b4a9ee471db86de19a2b81bd07.scope: Deactivated successfully.
May 31 09:32:58 openqa systemd[1]: run-r0c2b860ba18f403287c33cf626610ba8.scope: Deactivated successfully.
May 31 09:32:58 openqa systemd[1]: Started /usr/bin/systemctl restart openqa-gru.service.
May 31 09:32:58 openqa systemd[1]: Stopping The openQA web UI...
May 31 09:32:58 openqa openqa-gru[8966]: timeout: sending signal TERM to command ‘env’
May 31 09:32:58 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
May 31 09:32:58 openqa openqa-gru[8950]: timeout: sending signal TERM to command ‘env’
May 31 09:32:58 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
May 31 09:32:58 openqa systemd[1]: Stopped The openQA web UI.
May 31 09:32:58 openqa systemd[1]: Started /usr/bin/systemctl restart openqa-gru.service.
May 31 09:33:00 openqa telegraf[20769]: 2024-05-31T07:33:00Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/minion]: received status code 502 (Bad Gateway), expected any value out of [200]
May 31 09:33:00 openqa telegraf[20769]: 2024-05-31T07:33:00Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: received status code 502 (Bad Gateway), expected any value out of [200]
May 31 09:33:00 openqa systemd[1]: openqa-livehandler.service: Deactivated successfully.
May 31 09:33:00 openqa systemd[1]: Stopped Handler for live view in openQA's web UI.
May 31 09:33:00 openqa systemd[1]: Started Handler for live view in openQA's web UI.
May 31 09:33:03 openqa systemd[1]: openqa-gru.service: Deactivated successfully.
May 31 09:33:03 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
May 31 09:33:03 openqa systemd[1]: Starting Setup local PostgreSQL database for openQA...
May 31 09:33:03 openqa systemd[1]: openqa-setup-db.service: Deactivated successfully.
May 31 09:33:03 openqa systemd[1]: Finished Setup local PostgreSQL database for openQA.
May 31 09:33:03 openqa systemd[1]: Started The openQA web UI.
May 31 09:33:03 openqa systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
May 31 09:33:03 openqa systemd[1]: run-r2eee719abb5a417199da6e85357a9e1f.scope: Deactivated successfully.
May 31 09:33:03 openqa systemd[1]: run-rd4274c8a029746089b7014b5475441d6.scope: Deactivated successfully.
May 31 09:33:04 openqa openqa[10030]: Error when trying to get the database version: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or near "PRAGMA"
May 31 09:33:04 openqa openqa[10030]: LINE 1: PRAGMA synchronous = OFF
May 31 09:33:04 openqa openqa[10030]:         ^ at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line 26) line 18
May 31 09:33:04 openqa openqa[10030]: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or near "PRAGMA"
May 31 09:33:04 openqa openqa[10030]: LINE 1: PRAGMA synchronous = OFF
May 31 09:33:04 openqa openqa[10030]:         ^ at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
May 31 09:33:04 openqa su[10167]: (to postgres) root on none
May 31 09:33:04 openqa systemd[1]: Created slice User Slice of UID 26.
May 31 09:33:04 openqa systemd[1]: Starting User Runtime Directory /run/user/26...
May 31 09:33:04 openqa systemd[1]: Finished User Runtime Directory /run/user/26.
May 31 09:33:04 openqa systemd[1]: Starting User Manager for UID 26...
May 31 09:33:04 openqa systemd[1]: openqa-webui.service: Main process exited, code=exited, status=255/EXCEPTION
May 31 09:33:04 openqa systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
May 31 09:33:04 openqa openqa-gru[10031]: Error when trying to get the database version: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or nea>
May 31 09:33:04 openqa openqa-gru[10031]: LINE 1: PRAGMA synchronous = OFF
May 31 09:33:04 openqa openqa-gru[10031]:         ^ at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib>
May 31 09:33:04 openqa (systemd)[10176]: pam_unix(systemd-user:session): session opened for user postgres by (uid=0)
May 31 09:33:04 openqa su[10173]: (to postgres) root on none
May 31 09:33:04 openqa openqa-gru[10031]: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or near "PRAGMA"
May 31 09:33:04 openqa openqa-gru[10031]: LINE 1: PRAGMA synchronous = OFF
May 31 09:33:04 openqa openqa-gru[10031]:         ^ at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Cl>
May 31 09:33:04 openqa systemd[1]: openqa-gru.service: Main process exited, code=exited, status=255/EXCEPTION
May 31 09:33:04 openqa systemd[1]: openqa-gru.service: Failed with result 'exit-code'.

All our actions started after that

May 31 09:38:53 openqa systemd-logind[825]: New session 1659 of user jbaier.
Actions #16

Updated by okurz 7 months ago · Edited

Further update from DMs with mcaj:

(Martin Caj) I still working on it. it looks like there ware just a small problem on root file system.
see this: fsck.ext4 -fv /dev/mapper/loop0p1 […]. disk is now clean but /etc/fstab looks uncompleted
(Oliver Kurz) what do you mean with "uncompleted"? If there was corruption on the root disk then maybe even better to consider rolling back just this disk to a latest backup snapshot on netapp. I could check but I can't login over ssh yet though ping responds.
13:59 (Martin Caj) some problems are also on vdd disk, running xfs_repair there, but machine at least boot now

I could login over ssh now. cat /etc/openqa/database.ini looks interesting, similar to what we expected:

[production]
dsn = dbi:Pg:dbname=openqa
user = geekotest
on_connect_do = PRAGMA synchronous = OFF
sqlite_unicode = 1

Modification date: May 31 09:33, so either just "touched" or actually modified, corresponds to the salt deployment. I took a look on backup-vm.qe.nue2.suse.org and found that previous versions looked actually correct, the current one is really corrupted. Old:

[test]
dsn = dbi:SQLite:dbname=:memory:
on_connect_call = use_foreign_keys
on_connect_do = PRAGMA synchronous = OFF
sqlite_unicode = 1

[production]
dsn = dbi:Pg:dbname=openqa
user = geekotest

The last correct was May 31 07:32 /home/rsnapshot/alpha.3/openqa.suse.de/etc/openqa/database.ini, so May 31 09:33 is the actual first corrupted one.

I have now changed the current file to

[production]                                        
dsn = dbi:Pg:dbname=openqa                          
user = geekotest

14:42 (Oliver Kurz) I don't see any fsck running anymore. Is it completed?
14:45 (Martin Caj) I see there this: […] so its Out of memory problem ? we have there 28Gb
14:48 (Oliver Kurz) don't know, try again? By the way, can we please try a screen session by the root user or something for easier collaboration?
14:48 (Martin Caj) sure running tmux there with xfs_repair /dev/vdd in terminal 0
14:51 (Oliver Kurz) https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1717156532258&to=1717157918174 shows how xfs_repair used up the complete system memory. There is an option to run xfs_repair with a lower memory limit, e.g. xfs_repair -m $((10*1024)). I will not stop the currently running process but try to stop some other processes, e.g. velociraptor-client instead. systemctl stop velociraptor-client.service rsyncd.service salt-minion salt-master vsftpd.service smb.service

Actions #17

Updated by jbaier_cz 7 months ago

We do not have any logs from nginx from that timeframe. The stopping line corresponds to me issuing restart of the nginx.

May 27 13:30:06 openqa systemd[1]: Started The nginx HTTP and reverse proxy server.
May 31 09:39:51 openqa systemd[1]: Stopping The nginx HTTP and reverse proxy server...
May 31 09:39:56 openqa systemd[1]: nginx.service: State 'stop-sigterm' timed out. Killing.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6831 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6846 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6847 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6848 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6849 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6846 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6847 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6848 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Killing process 6849 (nginx) with signal SIGKILL.
May 31 09:39:56 openqa systemd[1]: nginx.service: Failed with result 'timeout'.
May 31 09:39:56 openqa systemd[1]: nginx.service: Unit process 6846 (nginx) remains running after unit stopped.
May 31 09:39:56 openqa systemd[1]: nginx.service: Unit process 6847 (nginx) remains running after unit stopped.
May 31 09:39:56 openqa systemd[1]: nginx.service: Unit process 6848 (nginx) remains running after unit stopped.
May 31 09:39:56 openqa systemd[1]: nginx.service: Unit process 6849 (nginx) remains running after unit stopped.
May 31 09:39:56 openqa systemd[1]: Stopped The nginx HTTP and reverse proxy server.
May 31 09:39:56 openqa systemd[1]: Starting The nginx HTTP and reverse proxy server...
May 31 09:39:56 openqa nginx[19713]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
May 31 09:39:56 openqa nginx[19713]: nginx: configuration file /etc/nginx/nginx.conf test is successful
May 31 09:39:56 openqa systemd[1]: Started The nginx HTTP and reverse proxy server.

Nothing interesting for postgres as well:

May 25 02:37:18 openqa systemd[1]: Started PostgreSQL database server.
May 31 09:51:44 openqa systemd[1]: Stopping PostgreSQL database server...
May 31 09:51:46 openqa systemd[1]: postgresql.service: Deactivated successfully.
May 31 09:51:46 openqa systemd[1]: Stopped PostgreSQL database server.
May 31 09:51:46 openqa systemd[1]: Starting PostgreSQL database server...
May 31 09:51:46 openqa postgresql-script[29086]: 2024-05-31 09:51:46.235 CEST   [29086]LOG:  redirecting log output to logging collector process
May 31 09:51:46 openqa postgresql-script[29086]: 2024-05-31 09:51:46.235 CEST   [29086]HINT:  Future log output will appear in directory "log".
May 31 09:51:46 openqa systemd[1]: Started PostgreSQL database server.
May 31 10:08:36 openqa systemd[1]: Stopping PostgreSQL database server...
May 31 10:08:36 openqa systemd[1]: postgresql.service: Deactivated successfully.
May 31 10:08:36 openqa systemd[1]: Stopped PostgreSQL database server.

Zypper history does not show any suspicious updates.

2024-05-31 07:14:20 <1> openqa(30653) [zypper] main.cc(main):115 ===== 'zypper' '--no-refresh' '-n' 'dup' '--replacefiles' =====
...
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133     rpm-4.14.3-150400.59.3.1.x86_64 [150179]
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133     rpm-4.14.3-150400.59.7.1.x86_64 [150184]
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133     rpm-4.14.3-150400.59.16.1.x86_64 [162816]I
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133 job: distupgrade all packages
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133 job: install openSUSE-release-15.5-lp155.286.1.x86_64
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133   - job Rule #62208:
2024-05-31 07:14:24 <1> openqa(30653) [libsolv] PoolImpl.cc(logSat):133     openSUSE-release-15.5-lp155.286.1.x86_64 [162242]I (w1)
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 choice rule creation took 11 ms
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 58437 pkg rules, 2 * 1884 update rules, 3 job rules, 7 infarch rules, 5 dup rules, 18 choice rules, 0 best rules, 0 yumobs rules
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 96 black rules, 0 recommends rules, 0 repo priority rules
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 overall rule memory used: 1460 K
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 solver statistics: 0 learned rules, 0 unsolvable, 0 minimization steps
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 done solving.
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 solver took 3 ms
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 final solver statistics: 0 problems, 0 learned rules, 0 unsolvable
2024-05-31 07:14:24 <1> openqa(30653) [libsolv++] PoolImpl.cc(logSat):131 solver_solve took 29 ms
2024-05-31 07:14:24 <1> openqa(30653) [zypp::solver] SATResolver.cc(solving):767 ....Solver end
2024-05-31 07:14:24 <1> openqa(30653) [zypp::solver] SATResolver.cc(resolvePool):937 SATResolver::resolvePool() done. Ret:1
2024-05-31 07:14:24 <1> openqa(30653) [zypper] solve-commit.cc(solve_and_commit):752 got solution, showing summary
2024-05-31 07:14:24 <1> openqa(30653) [zypper] Summary.cc(readPool):239 Pool contains 69041 items.
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):240 Install summary:
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <install>   U_Ts_(12)openQA-4.6.1717018874.26363281-lp155.6714.1.x86_64(devel_openQA)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <install>   U_Ts_(24)openQA-client-4.6.1717018874.26363281-lp155.6714.1.x86_64(devel_openQA)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <install>   U_Ts_(29)openQA-common-4.6.1717018874.26363281-lp155.6714.1.x86_64(devel_openQA)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <install>   U_Ts_(42)openQA-doc-4.6.1717018874.26363281-lp155.6714.1.x86_64(devel_openQA)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <install>   U_Ts_r(46)openQA-local-db-4.6.1717018874.26363281-lp155.6714.1.x86_64(devel_openQA)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <uninstall> I_TsU(162236)openQA-4.6.1717018874.26363281-lp155.6711.1.x86_64(@System)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <uninstall> I_TsU(162237)openQA-client-4.6.1717018874.26363281-lp155.6711.1.x86_64(@System)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <uninstall> I_TsU(162238)openQA-common-4.6.1717018874.26363281-lp155.6711.1.x86_64(@System)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <uninstall> I_TsU(162239)openQA-doc-4.6.1717018874.26363281-lp155.6711.1.x86_64(@System)
2024-05-31 07:14:24 <1> openqa(30653) [zypper++] Summary.cc(readPool):280 <uninstall> I_TsU(162240)openQA-local-db-4.6.1717018874.26363281-lp155.6711.1.x86_64(@System)
2024-05-31 07:14:25 <1> openqa(30653) [zypper] Summary.cc(readPool):465 package update candidates: 5
2024-05-31 07:14:25 <1> openqa(30653) [zypper] Summary.cc(readPool):466 to be actually updated: 5
2024-05-31 07:14:25 <1> openqa(30653) [zypper] Summary.cc(readPool):465 product update candidates: 0
2024-05-31 07:14:25 <1> openqa(30653) [zypper] Summary.cc(readPool):466 to be actually updated: 0
...
2024-05-31 07:32:13 <1> openqa(14658) [zypper] main.cc(main):115 ===== 'zypper' '--non-interactive' '--xmlout' '--no-refresh' 'search' '--match-exact' 'python3-augeas' =====
2024-05-31 07:32:23 <1> openqa(14820) [zypper] main.cc(main):115 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'velociraptor-client' =====
...
Actions #18

Updated by jbaier_cz 7 months ago

There is a minor error in salt-minion around that time as well, but nothing specific. Could be not related.

May 31 09:31:05 openqa salt-minion[1689]: /usr/lib/python3.6/site-packages/salt/states/x509.py:214: DeprecationWarning: The x509 modules are deprecated. Please migrate to the replacement modules (x>
May 31 09:31:05 openqa salt-minion[1689]:   "The x509 modules are deprecated. Please migrate to the replacement "
May 31 09:31:05 openqa salt-minion[1689]: /usr/lib/python3.6/site-packages/salt/states/x509.py:214: DeprecationWarning: The x509 modules are deprecated. Please migrate to the replacement modules (x>
May 31 09:31:05 openqa salt-minion[1689]:   "The x509 modules are deprecated. Please migrate to the replacement "
May 31 09:31:05 openqa salt-minion[1689]: [WARNING ] test.ping args: []
May 31 09:31:05 openqa salt-minion[1689]: [WARNING ] test.ping args: []
May 31 09:32:40 openqa salt-minion[1689]: [ERROR   ] Unexpected exception in git.latest state
May 31 09:32:40 openqa salt-minion[1689]: Traceback (most recent call last):
May 31 09:32:40 openqa salt-minion[1689]:   File "/usr/lib/python3.6/site-packages/salt/states/git.py", line 1357, in latest
May 31 09:32:40 openqa salt-minion[1689]:     saltenv=__env__,
May 31 09:32:40 openqa salt-minion[1689]: NameError: name '__env__' is not defined
May 31 09:32:40 openqa salt-minion[1689]: [ERROR   ] name '__env__' is not defined
Actions #19

Updated by okurz 7 months ago · Edited

xfs_repair was killed again due to OOM. restarting with xfs_repair -m $((10*1024)) /dev/vdd and also did choom -p $(pidof xfs_repair) -n -50

15:05 (Oliver Kurz) Another question: What's your current hypothesis why the machine could not come up after reboot on its own?
and /etc/fstab looks quite different now. Did you change it this much or do we have a much older snapshot of the filesystem in use?
15:10 (Martin Caj) When I look on the machine there was just this /etc/fstab.broken as fstab. what was the last action what did you do on the server ?
15:12 (Oliver Kurz) the last thing done was that salt was triggered. This seems to have corrupted /etc/openqa/database.ini and it also controls /etc/fstab. Maybe that caused a similar problem here. The correct fstab is in /srv/salt/fstab

Communication is a bit sluggish. Currently xfs_repair is still running. I asked about mitigations in case we need actions from the level of the hypervisor but have not received a response yet. I will continue the work for the next hours as long as I have access. In the best case the integrity checks finishes the next hours and we can mount all filesystems and restore services again. In a worse case we will need to repeat our plead for restoring from older storage snapshots to ensure consistency or recover with access to the hypervisor. In that case we would only be able to recover services over the course of the next week.

I would await the completion of xfs_repair, copy over the correct fstab, then try to mount all and bring up services again. Only in case we loose access again or if we need to resort to rolling back from netapp storage I would ask mcaj again.

xfs_repair finished. I mounted all partitions and restarted postgresql and openqa-webui first but stopped again openqa-scheduler.

Called failed_since="2024-05-31 07:00Z" host=openqa.suse.de comment="label:poo161309" openqa-advanced-retrigger-jobs which funnily retriggered exactly two jobs. There had been many incompletes but due to our already good handling they are all automatically restarted already.

We are in the process of cleaning up. openqa.suse.de should be back in full operation. Apparently salt triggered by a deploy job from one of the merge requests this morning triggered a corruption of configuration files which prevented the openQA service to start but also removed /etc/fstab so that the VM got stuck on bootup. With the help from mcaj from IT the VM was brought back into an accessible state. The most time is spent on checking filesystems where problems have been encountered. Some hours could have been saved when we would have access to the hypervisor host or better collaboration, e.g. shared screen and video conference. Without that the communication was a bit sluggish and error prone but acceptable.

Actions #20

Updated by jbaier_cz 7 months ago

  • Description updated (diff)
Actions #21

Updated by jbaier_cz 7 months ago

  • Description updated (diff)
Actions #22

Updated by jbaier_cz 7 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to High

osd should be running right now, rollback actions done; we will observe to see if there is any related, not yet discovered, problem

some follow-ups are already planned, we might add some additional tasks after conducting #161324

Actions #23

Updated by okurz 7 months ago

On https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 I found that at least one worker instance failed on worker35 due to a stale file handle referring to the NFS mount point. I called salt --no-color -C 'G@roles:worker' cmd.run 'umount --lazy /var/lib/openqa/share && sleep 1 && mount /var/lib/openqa/share; mount -a'

Actions #24

Updated by jbaier_cz 7 months ago

  • Status changed from Feedback to Resolved

Five whys session was conducted, osd reported no related error during the weekend, I believe we can close this for now as we already have follow-ups.

Actions #25

Updated by okurz 6 months ago

  • Copied to action #162332: 2024-06-15 osd not accessible size:M added
Actions

Also available in: Atom PDF