Project

General

Profile

Actions

action #127754

closed

osd nfs-server needed to be restarted but we got no alerts size:M

Added by tinita over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

See #121573 and https://suse.slack.com/archives/C02CANHLANP/p1681228800740289

s390zp18:/var/lib/openqa/share/factory # cd hdd/fixed/
-bash: cd: hdd/fixed/: Stale file handle

Suggestions

  • Research if "Stale file handle" for NFS can be prevented or better handled, maybe need to upgrade all machines to newer OS? s390zp18 is SLE12SP5 (and long uptime, likely no automatic upgrades)
  • Research for monitoring and alert for NFS mounts or handles
  • Try to reproduce the problem, e.g. with s390zp18 and OSD, maybe has to do with reboots of machines?

Related issues 3 (2 open1 closed)

Related to openQA Project - action #65450: workers on o3 power did not restart after upgrade as NFS mount point was stale "Ignoring host 'http://openqa1-opensuse': Working directory does not exist"Workable2020-04-08

Actions
Related to openQA Infrastructure - action #51836: Manage (parts) of s390 kvm instances (formerly s390p7 and s390p8) with saltResolvedokurz2019-05-22

Actions
Copied from openQA Project - action #121573: Asset/HDD goes missing while job is runningNew2022-12-06

Actions
Actions #1

Updated by tinita over 1 year ago

  • Copied from action #121573: Asset/HDD goes missing while job is running added
Actions #2

Updated by tinita over 1 year ago

  • Subject changed from osd nfs-server needed to be restarted butwe got no alerts to osd nfs-server needed to be restarted but we got no alerts
Actions #3

Updated by okurz over 1 year ago

  • Target version set to Ready
Actions #4

Updated by okurz over 1 year ago

  • Related to action #65450: workers on o3 power did not restart after upgrade as NFS mount point was stale "Ignoring host 'http://openqa1-opensuse': Working directory does not exist" added
Actions #5

Updated by okurz over 1 year ago

  • Related to action #51836: Manage (parts) of s390 kvm instances (formerly s390p7 and s390p8) with salt added
Actions #6

Updated by okurz over 1 year ago

  • Subject changed from osd nfs-server needed to be restarted but we got no alerts to osd nfs-server needed to be restarted but we got no alerts size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by okurz over 1 year ago

  • Assignee set to okurz

Researching some solutions for detecting the situations of stale NFS handles

Actions #9

Updated by nicksinger over 1 year ago

  • Assignee set to nicksinger
Actions #10

Updated by nicksinger over 1 year ago

I tried to understand the problem a bit more but didn't came to a full understanding yet. Found another useful tool here: https://github.com/acdha/mountstatus - it would need to be packaged before we can make use of it unfortunately.
Second best seems to be https://gist.github.com/cinsk/840ed553905cb6e8f0ae despite that I dislike writing a tmp file just to store a PID. Maybe this can be rewritten somehow.

Actions #11

Updated by okurz over 1 year ago

WDYT about a solution based on https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/amp/

#!/bin/sh
list=$(df 2>&1 | grep 'Stale file handle' | awk '{print ""$2"" }' | tr -d \:)
for directory in $list
do
    umount -l "$directory"
    mount -a
done

So far I don't see any flaws in that and could be the easiest solution. One could of course replace grep+awk+tr with sed but shouldn't matter.

Actions #12

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback

Script was deployed with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/874 and later fixed with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/875. I checked all workers for the file and timer being present:

openqa:~ # salt -C 'G@roles:worker' cmd.run 'ls -lah /usr/local/bin/recover-nfs.sh'
worker3.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker2.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker6.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker18.qa.suse.cz:
    -rwxr--r-- 1 root root 155 Jun  6 00:46 /usr/local/bin/recover-nfs.sh
openqaworker16.qa.suse.cz:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker5.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker8.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker17.qa.suse.cz:
    -rwxr--r-- 1 root root 155 Jun  6 00:46 /usr/local/bin/recover-nfs.sh
worker9.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker1.qe.nue2.suse.org:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker14.qa.suse.cz:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
powerqaworker-qam-1.qa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
QA-Power8-5-kvm.qa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker11.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker13.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
QA-Power8-4-kvm.qa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
malbec.arch.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker10.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
worker12.oqa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
grenache-1.qa.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker-arm-1.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:49 /usr/local/bin/recover-nfs.sh
openqaworker-arm-2.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:49 /usr/local/bin/recover-nfs.sh
openqaworker-arm-3.suse.de:
    -rwxr--r-- 1 root root 155 Jun  6 00:49 /usr/local/bin/recover-nfs.sh
openqa:~ # salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs'
openqaworker16.qa.suse.cz:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker18.qa.suse.cz:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker9.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker17.qa.suse.cz:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker8.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker14.qa.suse.cz:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
malbec.arch.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker2.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker3.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker12.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker11.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
grenache-1.qa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker13.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker10.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker6.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
QA-Power8-5-kvm.qa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
worker5.oqa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker1.qe.nue2.suse.org:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
QA-Power8-4-kvm.qa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
powerqaworker-qam-1.qa.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker-arm-2.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker-arm-1.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
openqaworker-arm-3.suse.de:
    * recover-nfs.service - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
         Active: inactive (dead)
ERROR: Minions returned with non-zero exit code
openqa:~ # salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs.timer'
openqaworker16.qa.suse.cz:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker18.qa.suse.cz:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker17.qa.suse.cz:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker2.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker3.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker8.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker9.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker6.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker1.qe.nue2.suse.org:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
powerqaworker-qam-1.qa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker14.qa.suse.cz:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
QA-Power8-5-kvm.qa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker13.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
QA-Power8-4-kvm.qa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker11.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker5.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker12.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
worker10.oqa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
malbec.arch.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
grenache-1.qa.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker-arm-1.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker-arm-2.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
openqaworker-arm-3.suse.de:
    * recover-nfs.timer - Automatically recover stall nfs shares.
         Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
         Active: inactive (dead)
        Trigger: n/a
       Triggers: * recover-nfs.service
ERROR: Minions returned with non-zero exit code

To enable the timer I created: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/876

Actions #13

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

Fixed another typo in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/877 and added a suggestion from @mkittler to depend on the required mountpoint where the script is stored. I verified with salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs.timer' that the timer is active and we don't see "Failed systemd services"-alerts so I think we can conclude for now.

Actions

Also available in: Atom PDF