Project

General

Profile

Actions

action #127097

closed

[alert] Failed systemd services alert

Added by mkittler about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-04-03
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Affected hosts/services:

  1. openqaw5-xen: stop-iface-ovs-system
  2. baremetal-support: openqa-scheduler, openqa-webui
  3. openqa-piworker: logrotate
  4. worker11: var-lib-openqa-share.mount

These are all distinct issues. I'll possibly create separate tickets if they take too long to fix.


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-systemResolvedrfan12023-03-27

Actions
Related to openQA Infrastructure - action #158041: grenache needs upgrade to 15.5Resolvedokurz2024-03-262024-04-09

Actions
Actions #1

Updated by mkittler about 1 year ago

About 1: This service is deployed via /etc/systemd/system/stop-iface-ovs-system.service which does not belong to a package and lacks a description. Restarting the service worked. However, it is not clear what the purpose of the service is. The error was:

martchus@openqaw5-xen:~> sudo journalctl -fu stop-iface-ovs-system.service
Apr 02 03:34:02 openqaw5-xen bash[2392]: + virsh iface-list --all
Apr 02 03:34:02 openqaw5-xen bash[2393]: + grep -w active
Apr 02 03:34:02 openqaw5-xen bash[2395]: + grep ovs-system
Apr 02 03:34:02 openqaw5-xen bash[2394]: + awk '{ print $1 }'
Apr 02 03:34:10 openqaw5-xen bash[2395]: ovs-system
Apr 02 03:34:10 openqaw5-xen bash[2388]: + '[' 0 -eq 0 ']'
Apr 02 03:34:10 openqaw5-xen bash[2388]: + virsh iface-destroy ovs-system
Apr 02 03:34:10 openqaw5-xen bash[3352]: Interface ovs-system destroyed
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 03:34:10 openqaw5-xen systemd[1]: stop-iface-ovs-system.service: Failed with result 'exit-code'.

For now I have dropped a message about the problem on #eng-testing. This must be good enough considering it is really hard to tell where service/script is coming from and I don't think reverse engineering this setup is worth the effort.


About 2: The web UI services worked again after restarting. The error was:

-- Boot 1eeaff687b724b6fb7be596956c3ee46 --
Apr 02 03:37:18 baremetal-support systemd[1]: Started The openQA web UI.
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: Error when trying to get the database version: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]:         Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line 26) line 18
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]: DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=openqa','',...) failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Apr 02 03:39:39 baremetal-support openqa-webui-daemon[2555]:         Is the server running locally and accepting connections on that socket? at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/Storage/DBI.pm line 1517. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 02 03:39:39 baremetal-support systemd[1]: openqa-webui.service: Failed with result 'exit-code'.

So for some reason PostgreSQL hasn't been started at the point the web UI was attempted to start. We have #125903 and #122578 but those are about DB shutdowns and thus not really related. The openQA-local-db is installed on the host and dependencies on systemd-level seem correct and PostgreSQL is linked against libsystemd.so.0 and thus should signal its readiness correctly. So I'm out of ideas what went wrong there.

Actions #2

Updated by mkittler about 1 year ago

  • Tags set to alert, infra
Actions #3

Updated by mkittler about 1 year ago

About 3: The of logrotate is empty once again. I suppose we best just ignore this reoccurring issue as it eventually fixes itself anyways.


About 4: This was a DNS problem:

Apr 02 03:36:02 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Apr 02 03:36:02 worker11 mount[9438]: mount.nfs: Failed to resolve server openqa.suse.de: Name or service not known
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Apr 02 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Apr 02 03:36:02 worker11 systemd[1]: Failed to mount /var/lib/openqa/share.

Considering it fixed itself this is something we maybe also just want to ignore somehow. Supposedly it was not a problem with the DNS server but the NFS mount has been attempted when DNS has not been setup on the host. (We already have noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m in /etc/fstab but the nofail option and retires are apparently not enough to prevent the unit entirely from failing.)

Actions #4

Updated by okurz about 1 year ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #5

Updated by rfan1 about 1 year ago

  • Related to action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-system added
Actions #6

Updated by rfan1 about 1 year ago

For item 1: openqaw5-xen: stop-iface-ovs-system

I have fixed my code, and more detail info can be found at:
https://progress.opensuse.org/issues/126647

Actions #7

Updated by openqa_review about 1 year ago

  • Due date set to 2023-04-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler about 1 year ago

About 1: I have also updated the description of the service so next time we know who is responsible for it. I think this worked out rather well in the end so I wouldn't have a problem catching this error again to forward it.

About 2 and 4: Likely not worth investigating at this point after just one occurrence.

About 3: I'm looking into how we can ignore it.

Actions #10

Updated by rfan1 about 1 year ago

mkittler wrote:

About 1: Unfortunately it has been failing again: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1680522497651&to=1680625869799

I restarted the service on xen host to test my code today.with original script which had the issue
2023-04-04 03:36:00 systemctl restart stop-iface-ovs-system.service
I think the failed service was captured by openqa monitor.

Right now, with the new script, whether bridge ovs-system is present or now, start/restart the service can both get return code 0.

Actions #11

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback

About 1: I've mentioned this in the chat and @rfan1 is looking into it.

About 3: I've created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/830 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/517 to exclude it.

Actions #12

Updated by mkittler about 1 year ago

About 3: Looks like the exclusion for the logrotate service has been deployed as expected on pi-worker (and there is no exclusion on other workers).

Actions #13

Updated by okurz about 1 year ago

  • Due date deleted (2023-04-18)
  • Status changed from Feedback to Resolved

All originally mentioned alerts look good again. There is currently 1 firing which is to be solved in #127274

Actions #14

Updated by okurz about 1 month ago

Actions

Also available in: Atom PDF