action #127097: [alert] Failed systemd services alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #127097

closed

[alert] Failed systemd services alert

Added by mkittler about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-04-03

Due date:

% Done:

Estimated time:

Tags:

alert, infra

Description

Affected hosts/services:

openqaw5-xen: stop-iface-ovs-system
baremetal-support: openqa-scheduler, openqa-webui
openqa-piworker: logrotate
worker11: var-lib-openqa-share.mount

These are all distinct issues. I'll possibly create separate tickets if they take too long to fix.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by mkittler about 2 years ago

About 1: This service is deployed via /etc/systemd/system/stop-iface-ovs-system.service which does not belong to a package and lacks a description. Restarting the service worked. However, it is not clear what the purpose of the service is. The error was:

martchus@openqaw5-xen:~> sudo journalctl -fu stop-iface-ovs-system.service
Apr 02 03:34:02 openqaw5-xen bash[2392]: + virsh iface-list --all
Apr 02 03:34:02 openqaw5-xen bash[2393]: + grep -w active
Apr 02 03:34:02 openqaw5-xen bash[2395]: + grep ovs-system
Apr 02 03:34:02 openqaw5-xen bash[2394]: + awk '{ print $1 }'
Apr 02 03:34:10 openqaw5-xen bash[2395]: ovs-system
Apr 02 03:34:10 openqaw5-xen bash[2388]: + '[' 0 -eq 0 ']'
Apr 02 03:34:10 openqaw5-xen bash[2388]: + virsh iface-destroy ovs-system
Apr 02 03:34:10 openqaw5-xen bash[3352]: Interface ovs-system destroyed
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Failed with result 'exit-code'.

For now I have dropped a message about the problem on #eng-testing. This must be good enough considering it is really hard to tell where service/script is coming from and I don't think reverse engineering this setup is worth the effort.

About 2: The web UI services worked again after restarting. The error was:

-- Boot 1eeaff687b724b6fb7be596956c3ee46 --
Apr 02 03:37:18 baremetal-support systemd[1]: Started The openQA web UI.
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: Error when trying to get the database version: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]:         Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line 26) line 18
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]:         Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Failed with result 'exit-code'.

So for some reason PostgreSQL hasn't been started at the point the web UI was attempted to start. We have #125903 and #122578 but those are about DB shutdowns and thus not really related. The openQA-local-db is installed on the host and dependencies on systemd-level seem correct and PostgreSQL is linked against libsystemd.so.0 and thus should signal its readiness correctly. So I'm out of ideas what went wrong there.

Actions

Copy link

Updated by mkittler about 2 years ago

Tags set to alert, infra

Actions

Copy link

Updated by mkittler about 2 years ago

About 3: The of logrotate is empty once again. I suppose we best just ignore this reoccurring issue as it eventually fixes itself anyways.

About 4: This was a DNS problem:

Apr 02 03:36:02 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Apr 02 03:36:02 worker11 mount[9438]: mount.nfs: Failed to resolve server openqa.suse.de: Name or service not known
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Apr 02 03:36:02 worker11 systemd[1]: Failed to mount /var/lib/openqa/share.

Considering it fixed itself this is something we maybe also just want to ignore somehow. Supposedly it was not a problem with the DNS server but the NFS mount has been attempted when DNS has not been setup on the host. (We already have noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m in /etc/fstab but the nofail option and retires are apparently not enough to prevent the unit entirely from failing.)

Actions

Copy link

Updated by okurz about 2 years ago

Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by rfan1 about 2 years ago

Related to action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-system added

Actions

Copy link

Updated by rfan1 about 2 years ago

For item 1: openqaw5-xen: stop-iface-ovs-system

I have fixed my code, and more detail info can be found at:
https://progress.opensuse.org/issues/126647

Actions

Copy link

Updated by openqa_review about 2 years ago

Due date set to 2023-04-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler about 2 years ago

About 1: I have also updated the description of the service so next time we know who is responsible for it. I think this worked out rather well in the end so I wouldn't have a problem catching this error again to forward it.

About 2 and 4: Likely not worth investigating at this point after just one occurrence.

About 3: I'm looking into how we can ignore it.

Actions

Copy link

Updated by mkittler about 2 years ago

About 1: Unfortunately it has been failing again: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1680522497651&to=1680625869799

Actions

Copy link

#10

Updated by rfan1 about 2 years ago

mkittler wrote:

About 1: Unfortunately it has been failing again: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1680522497651&to=1680625869799

I restarted the service on xen host to test my code today.with original script which had the issue
2023-04-04 03:36:00 systemctl restart stop-iface-ovs-system.service
I think the failed service was captured by openqa monitor.

Right now, with the new script, whether bridge ovs-system is present or now, start/restart the service can both get return code 0.

Actions

Copy link

#11