action #168358
closedopenQA Project - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
[alert] Failed systemd services alert: var-lib-openqa-share-factory*.mount
0%
Description
Observation¶
Date: Thu, 17 Oct 2024 11:00:40 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/UzAhcmBVk/view?orgId=1
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
2024-10-17 09:07:40
s390zl12
var-lib-openqa-share-factory.mount
1
2024-10-17 09:07:40 openqa var-lib-openqa-share-factory-hdd-fixed.automount, var-lib-openqa-share-factory-iso-fixed.automount 2
Updated by okurz about 1 month ago
- Tags set to reactive work, infra, osd, nfs
- Assignee set to nicksinger
- Parent task set to #157969
Updated by nicksinger about 1 month ago
- Status changed from New to Resolved
While debugging a fail reported in https://progress.opensuse.org/issues/157981 I tried to restart our nfs-server with systemctl restart nfs-server
. This snowballed out of control with nfs failing to start and only printing "[nfs] failed to connect to openqa.suse.de" over and over again in dmesg which I have no clue where it should come from. I checked /etc/fstab on OSD and it does not mention itself.
After eventually everything hang I decided to reboot OSD which did not work because also systemd itself was stuck in IOwait. So I had to use magicsysrqs which I confused and issued "i" which killed my last remaining ssh session. After quickly asking Gerhard for help he was able to forcefully restart the VM. After it booted everything seems good again. @okurz already took care of restarting failing jobs in that period.
Updated by jbaier_cz about 1 month ago
- Related to action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notify added