Project

General

Profile

Actions

action #168358

closed

openQA Project - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

[alert] Failed systemd services alert: var-lib-openqa-share-factory*.mount

Added by tinita 10 days ago. Updated 10 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-10-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Thu, 17 Oct 2024 11:00:40 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/UzAhcmBVk/view?orgId=1
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

2024-10-17 09:07:40 
s390zl12    
var-lib-openqa-share-factory.mount  
1
2024-10-17 09:07:40 openqa  var-lib-openqa-share-factory-hdd-fixed.automount, var-lib-openqa-share-factory-iso-fixed.automount  2

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notifyResolvedjbaier_cz

Actions
Actions #1

Updated by okurz 10 days ago

  • Tags set to reactive work, infra, osd, nfs
  • Assignee set to nicksinger
  • Parent task set to #157969
Actions #2

Updated by nicksinger 10 days ago

  • Status changed from New to Resolved

While debugging a fail reported in https://progress.opensuse.org/issues/157981 I tried to restart our nfs-server with systemctl restart nfs-server. This snowballed out of control with nfs failing to start and only printing "[nfs] failed to connect to openqa.suse.de" over and over again in dmesg which I have no clue where it should come from. I checked /etc/fstab on OSD and it does not mention itself.

After eventually everything hang I decided to reboot OSD which did not work because also systemd itself was stuck in IOwait. So I had to use magicsysrqs which I confused and issued "i" which killed my last remaining ssh session. After quickly asking Gerhard for help he was able to forcefully restart the VM. After it booted everything seems good again. @okurz already took care of restarting failing jobs in that period.

Actions #3

Updated by jbaier_cz 6 days ago

  • Related to action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notify added
Actions

Also available in: Atom PDF