action #168358: [alert] Failed systemd services alert: var-lib-openqa-share-factory*.mount - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #168358

closed

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

[alert] Failed systemd services alert: var-lib-openqa-share-factory*.mount

Added by tinita 7 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-10-17

Due date:

% Done:

Estimated time:

Tags:

nfs, osd, infra, reactive work

Description

Observation¶

Date: Thu, 17 Oct 2024 11:00:40 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/UzAhcmBVk/view?orgId=1
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

2024-10-17 09:07:40	
s390zl12	
var-lib-openqa-share-factory.mount	
1
2024-10-17 09:07:40	openqa	var-lib-openqa-share-factory-hdd-fixed.automount, var-lib-openqa-share-factory-iso-fixed.automount	2

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 7 months ago

Tags set to reactive work, infra, osd, nfs
Assignee set to nicksinger
Parent task set to #157969

Actions

Copy link

Updated by nicksinger 7 months ago

Status changed from New to Resolved

While debugging a fail reported in https://progress.opensuse.org/issues/157981 I tried to restart our nfs-server with systemctl restart nfs-server. This snowballed out of control with nfs failing to start and only printing "[nfs] failed to connect to openqa.suse.de" over and over again in dmesg which I have no clue where it should come from. I checked /etc/fstab on OSD and it does not mention itself.

After eventually everything hang I decided to reboot OSD which did not work because also systemd itself was stuck in IOwait. So I had to use magicsysrqs which I confused and issued "i" which killed my last remaining ssh session. After quickly asking Gerhard for help he was able to forcefully restart the VM. After it booted everything seems good again. @okurz already took care of restarting failing jobs in that period.

Actions

Copy link

Updated by jbaier_cz 7 months ago

Related to action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notify added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #168358

[alert] Failed systemd services alert: var-lib-openqa-share-factory*.mount

Observation¶

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by jbaier_cz 7 months ago