action #127097
closed[alert] Failed systemd services alert
Added by mkittler over 1 year ago. Updated over 1 year ago.
0%
Description
Affected hosts/services:
openqaw5-xen
:stop-iface-ovs-system
baremetal-support
:openqa-scheduler, openqa-webui
openqa-piworker
:logrotate
worker11
:var-lib-openqa-share.mount
These are all distinct issues. I'll possibly create separate tickets if they take too long to fix.
Updated by mkittler over 1 year ago
About 1: This service is deployed via /etc/systemd/system/stop-iface-ovs-system.service
which does not belong to a package and lacks a description. Restarting the service worked. However, it is not clear what the purpose of the service is. The error was:
martchus@openqaw5-xen:~> sudo journalctl -fu stop-iface-ovs-system.service
Apr 02 03:34:02 openqaw5-xen bash[2392]: + virsh iface-list --all
Apr 02 03:34:02 openqaw5-xen bash[2393]: + grep -w active
Apr 02 03:34:02 openqaw5-xen bash[2395]: + grep ovs-system
Apr 02 03:34:02 openqaw5-xen bash[2394]: + awk '{ print $1 }'
Apr 02 03:34:10 openqaw5-xen bash[2395]: ovs-system
Apr 02 03:34:10 openqaw5-xen bash[2388]: + '[' 0 -eq 0 ']'
Apr 02 03:34:10 openqaw5-xen bash[2388]: + virsh iface-destroy ovs-system
Apr 02 03:34:10 openqaw5-xen bash[3352]: Interface ovs-system destroyed
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Failed with result 'exit-code'.
For now I have dropped a message about the problem on #eng-testing. This must be good enough considering it is really hard to tell where service/script is coming from and I don't think reverse engineering this setup is worth the effort.
About 2: The web UI services worked again after restarting. The error was:
-- Boot 1eeaff687b724b6fb7be596956c3ee46 --
Apr 02 03:37:18 baremetal-support systemd[1]: Started The openQA web UI.
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: Error when trying to get the database version: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line 26) line 18
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
So for some reason PostgreSQL hasn't been started at the point the web UI was attempted to start. We have #125903 and #122578 but those are about DB shutdowns and thus not really related. The openQA-local-db
is installed on the host and dependencies on systemd-level seem correct and PostgreSQL is linked against libsystemd.so.0
and thus should signal its readiness correctly. So I'm out of ideas what went wrong there.
Updated by mkittler over 1 year ago
About 3: The of logrotate is empty once again. I suppose we best just ignore this reoccurring issue as it eventually fixes itself anyways.
About 4: This was a DNS problem:
Apr 02 03:36:02 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Apr 02 03:36:02 worker11 mount[9438]: mount.nfs: Failed to resolve server openqa.suse.de: Name or service not known
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Apr 02 03:36:02 worker11 systemd[1]: Failed to mount /var/lib/openqa/share.
Considering it fixed itself this is something we maybe also just want to ignore somehow. Supposedly it was not a problem with the DNS server but the NFS mount has been attempted when DNS has not been setup on the host. (We already have noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m
in /etc/fstab
but the nofail
option and retires are apparently not enough to prevent the unit entirely from failing.)
Updated by okurz over 1 year ago
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by rfan1 over 1 year ago
- Related to action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-system added
Updated by rfan1 over 1 year ago
For item 1: openqaw5-xen: stop-iface-ovs-system
I have fixed my code, and more detail info can be found at:
https://progress.opensuse.org/issues/126647
Updated by openqa_review over 1 year ago
- Due date set to 2023-04-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
About 1: I have also updated the description of the service so next time we know who is responsible for it. I think this worked out rather well in the end so I wouldn't have a problem catching this error again to forward it.
About 2 and 4: Likely not worth investigating at this point after just one occurrence.
About 3: I'm looking into how we can ignore it.
Updated by mkittler over 1 year ago
About 1: Unfortunately it has been failing again: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1680522497651&to=1680625869799
Updated by rfan1 over 1 year ago
mkittler wrote:
About 1: Unfortunately it has been failing again: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1680522497651&to=1680625869799
I restarted the service on xen host to test my code today.with original script which had the issue
2023-04-04 03:36:00 systemctl restart stop-iface-ovs-system.service
I think the failed service was captured by openqa monitor.
Right now, with the new script, whether bridge ovs-system
is present or now, start/restart the service can both get return code 0.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
About 1: I've mentioned this in the chat and @rfan1 is looking into it.
About 3: I've created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/830 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/517 to exclude it.
Updated by mkittler over 1 year ago
About 3: Looks like the exclusion for the logrotate service has been deployed as expected on pi-worker (and there is no exclusion on other workers).
Updated by okurz over 1 year ago
- Due date deleted (
2023-04-18) - Status changed from Feedback to Resolved
All originally mentioned alerts look good again. There is currently 1 firing which is to be solved in #127274
Updated by okurz 6 months ago
- Related to action #158041: grenache needs upgrade to 15.5 added