Project

General

Profile

action #103530

failed systemd services alert - openqaworker-arm-3 - ovsdb-server size:M

Added by cdywan 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-12-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

failed systemd services alert w/o osd started alerting Sunday 5.24 CEST.

2021-12-06 09:36:00        openqaworker-arm-3        ovsdb-server        1

sudo journalctl -u ovsdb-server says:

openqaworker-arm-3 chown[1907]: /usr/bin/chown: cannot access '/run/openvswitch': No such file or directory
openqaworker-arm-3 ovs-ctl[1912]: /lib/lsb/init-functions: line 8: /etc/rc.status: No such file or directory

Acceptance criteria

  • AC1: no services are failing
  • AC2: ovsdb-server on openwaworker-arm3 is not failing

Suggestions

  • ~Check systemctl status ovsdb-server~
    • journalctl -u
  • Investigate what ovsdb-server is and why it failed
  • Check automatic upgrade e.g. changed config

History

#1 Updated by cdywan 6 months ago

  • Subject changed from failed systemd services alert - openqaworker-arm-3 - ovsdb-server to failed systemd services alert - openqaworker-arm-3 - ovsdb-server size:M
  • Description updated (diff)
  • Status changed from New to Workable

#2 Updated by okurz 6 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

Also when I log in over ssh as okurz I get

-bash: /etc/profile: No such file or directory
-bash-4.4$

so not the correct prompt. Something more seems to broken here.

sudo snapper diff 1111..1114 shows that e.g. /etc/bashrc is completely removed. I will try a rollback and clean upgrade.

#3 Updated by okurz 6 months ago

# snapper rollback 1111
Ambit is classic.
Creating read-only snapshot of current system. (Snapshot 1115.)
Creating read-write snapshot of snapshot 1111. (Snapshot 1116.)
Setting default subvolume to snapshot 1116.
bash-4.4# reboot

back to a working bash prompt.

I executed

host=openqa.suse.de WORKER=openqaworker-arm-3 failed_since=2021-12-04 bash -ex openqa-advanced-retrigger-jobs

which retriggered a handful of tests, not too many.

Did a zypper dup which brought in some minor updates.

Taking a look into /var/log/zypp/history to see if there is something broken reported there:

2021-12-04 03:01:51|command|root@openqaworker-arm-3|'zypper' '-n' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--force-resolution' '--download-in-advance'|
2021-12-04 03:01:51|install|salt|3002.2-53.4.1|aarch64||repo-sle-update|a5de12a911422a3ca27aa124cc97188b4a7847d256c3a7d0df14efbf3b195891|
2021-12-04 03:02:08|install|python3-salt|3002.2-53.4.1|aarch64||repo-sle-update|6a675324eb22c93f6a084546defabed5976503f06e1cb413d8616ca8f9007a0b|
2021-12-04 03:02:32|install|salt-minion|3002.2-53.4.1|aarch64||repo-sle-update|a0d0ecd602e9e3254e69c0bd94ec423b13fe3704b3e119c3084cce0986a1de42|
2021-12-04 03:02:32|patch  |openSUSE-SLE-15.3-2021-3922|1|noarch|repo-sle-update|moderate|recommended|needed|applied|
2021-12-06 03:00:26|command|root@openqaworker-arm-3|'zypper' '-n' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--force-resolution' '--download-in-advance'|
# 2021-12-06 03:00:31 aaa_base-extras-84.87+git20180409.04c9dae-3.52.1.aarch64.rpm installed ok
# Additional rpm output:
# Updating /etc/sysconfig/backup ...
# 
2021-12-06 03:00:31|install|aaa_base-extras|84.87+git20180409.04c9dae-3.52.1|aarch64||repo-sle-update|9b78a0144c68c730b96e341b7c57e6397bf905e47050a6b89b67e141173b37a5|
2021-12-06 03:00:32|install|keyutils|1.6.3-5.6.1|aarch64||repo-sle-update|f2ed82f8419bc63161bf91d7af39b84675ed5ae308b801077da2c48bb435a6c8|
2021-12-06 03:00:32|install|release-notes-sles|15.3.20211201-3.17.1|noarch||repo-sle-update|fa3c1ec35f2e768b503d4cc25768e417b39990266ea46fb4bd8f2829fa222546|
2021-12-06 03:00:32|patch  |openSUSE-SLE-15.3-2021-3899|1|noarch|repo-sle-update|moderate|security|needed|applied|
2021-12-06 03:00:32|patch  |openSUSE-SLE-15.3-2021-3891|1|noarch|repo-sle-update|moderate|recommended|needed|applied|
2021-12-06 03:00:32|patch  |openSUSE-SLE-15.3-2021-3896|1|noarch|repo-sle-update|low|recommended|needed|applied|

nothing obvious. Now two services failed logrotate and snapper-cleanup. logrotate failed due to rotate target files already existing for the same day, obviously because before the rotate those archives had been created already. This could be fixed by a simple restart with systemctl start logrotate where subsequent systemctl status logrotate shows green status again. snapper-cleanup fails and snapper -v cleanup number shows that it fails to delete snapshot nr. 1. This looks very related to #102942

btrfs subvolume list / | grep btrfs shows

ID 1207 gen 2205973 top level 259 path @/.snapshots/1/snapshot/var/lib/containers/storage/btrfs/subvolumes/f140b3cbd9d1062ecafbd05115442711aaa23b9c576e57a11a8850a37ac875f2

so I did

openqaworker-arm-3:/home/okurz # btrfs subvolume delete /.snapshots/1/snapshot/var/lib/containers/storage/btrfs/subvolumes/f140b3cbd9d1062ecafbd05115442711aaa23b9c576e57a11a8850a37ac875f2
Delete subvolume (no-commit): '/.snapshots/1/snapshot/var/lib/containers/storage/btrfs/subvolumes/f140b3cbd9d1062ecafbd05115442711aaa23b9c576e57a11a8850a37ac875f2'
openqaworker-arm-3:/home/okurz # snapper delete 1
openqaworker-arm-3:/home/okurz # systemctl start snapper-cleanup
openqaworker-arm-3:/home/okurz # systemctl status snapper-cleanup
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
     Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static)
     Active: active (running) since Mon 2021-12-06 12:06:15 CET; 5s ago
TriggeredBy: ● snapper-cleanup.timer
       Docs: man:snapper(8)
             man:snapper-configs(5)
   Main PID: 12908 (systemd-helper)
      Tasks: 1 (limit: 14745)
     CGroup: /system.slice/snapper-cleanup.service
             └─12908 /usr/lib/snapper/systemd-helper --cleanup

Dec 06 12:06:15 openqaworker-arm-3 systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Dec 06 12:06:17 openqaworker-arm-3 systemd-helper[12908]: running cleanup for 'root'.
Dec 06 12:06:17 openqaworker-arm-3 systemd-helper[12908]: running number cleanup for 'root'.

and all good again.

#4 Updated by okurz 6 months ago

  • Status changed from In Progress to Resolved

openqaworker-arm-3 is happily working on jobs, e.g. https://openqa.suse.de/tests/7791013

https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&editPanel=6&tab=alert shows no failed systemd services so all good

Also available in: Atom PDF