action #127754
closedosd nfs-server needed to be restarted but we got no alerts size:M
0%
Description
Observation¶
See #121573 and https://suse.slack.com/archives/C02CANHLANP/p1681228800740289
s390zp18:/var/lib/openqa/share/factory # cd hdd/fixed/
-bash: cd: hdd/fixed/: Stale file handle
Suggestions¶
- Research if "Stale file handle" for NFS can be prevented or better handled, maybe need to upgrade all machines to newer OS? s390zp18 is SLE12SP5 (and long uptime, likely no automatic upgrades)
- Research for monitoring and alert for NFS mounts or handles
- Try to reproduce the problem, e.g. with s390zp18 and OSD, maybe has to do with reboots of machines?
Updated by tinita over 1 year ago
- Copied from action #121573: Asset/HDD goes missing while job is running added
Updated by tinita over 1 year ago
- Subject changed from osd nfs-server needed to be restarted butwe got no alerts to osd nfs-server needed to be restarted but we got no alerts
Updated by okurz over 1 year ago
- Related to action #65450: workers on o3 power did not restart after upgrade as NFS mount point was stale "Ignoring host 'http://openqa1-opensuse': Working directory does not exist" added
Updated by okurz over 1 year ago
- Related to action #51836: Manage (parts) of s390 kvm instances (formerly s390p7 and s390p8) with salt added
Updated by okurz over 1 year ago
- Subject changed from osd nfs-server needed to be restarted but we got no alerts to osd nfs-server needed to be restarted but we got no alerts size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Assignee set to okurz
Researching some solutions for detecting the situations of stale NFS handles
Updated by okurz over 1 year ago
- Assignee deleted (
okurz)
Some ideas:
- https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/amp/
- https://support.delphix.com/Continuous_Data_Engine_(formerly_Virtualization_Engine)/Delphix_Admin/Resolving_%22Stale_File_Handle%22_Error_on_Linux_Systems_(KBA1037)
- https://stackoverflow.com/questions/1643347/is-there-a-good-way-to-detect-a-stale-nfs-mount
- https://gist.github.com/cinsk/840ed553905cb6e8f0ae
Updated by nicksinger over 1 year ago
I tried to understand the problem a bit more but didn't came to a full understanding yet. Found another useful tool here: https://github.com/acdha/mountstatus - it would need to be packaged before we can make use of it unfortunately.
Second best seems to be https://gist.github.com/cinsk/840ed553905cb6e8f0ae despite that I dislike writing a tmp file just to store a PID. Maybe this can be rewritten somehow.
Updated by okurz over 1 year ago
WDYT about a solution based on https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/amp/
#!/bin/sh
list=$(df 2>&1 | grep 'Stale file handle' | awk '{print ""$2"" }' | tr -d \:)
for directory in $list
do
umount -l "$directory"
mount -a
done
So far I don't see any flaws in that and could be the easiest solution. One could of course replace grep+awk+tr with sed but shouldn't matter.
Updated by nicksinger over 1 year ago
- Status changed from Workable to Feedback
Script was deployed with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/874 and later fixed with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/875. I checked all workers for the file and timer being present:
openqa:~ # salt -C 'G@roles:worker' cmd.run 'ls -lah /usr/local/bin/recover-nfs.sh'
worker3.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker2.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker6.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker18.qa.suse.cz:
-rwxr--r-- 1 root root 155 Jun 6 00:46 /usr/local/bin/recover-nfs.sh
openqaworker16.qa.suse.cz:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker5.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker8.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker17.qa.suse.cz:
-rwxr--r-- 1 root root 155 Jun 6 00:46 /usr/local/bin/recover-nfs.sh
worker9.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker1.qe.nue2.suse.org:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker14.qa.suse.cz:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
powerqaworker-qam-1.qa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
QA-Power8-5-kvm.qa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker11.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker13.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
QA-Power8-4-kvm.qa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
malbec.arch.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker10.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
worker12.oqa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
grenache-1.qa.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:47 /usr/local/bin/recover-nfs.sh
openqaworker-arm-1.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:49 /usr/local/bin/recover-nfs.sh
openqaworker-arm-2.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:49 /usr/local/bin/recover-nfs.sh
openqaworker-arm-3.suse.de:
-rwxr--r-- 1 root root 155 Jun 6 00:49 /usr/local/bin/recover-nfs.sh
openqa:~ # salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs'
openqaworker16.qa.suse.cz:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker18.qa.suse.cz:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker9.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker17.qa.suse.cz:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker8.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker14.qa.suse.cz:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
malbec.arch.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker2.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker3.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker12.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker11.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
grenache-1.qa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker13.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker10.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker6.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
QA-Power8-5-kvm.qa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
worker5.oqa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker1.qe.nue2.suse.org:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
QA-Power8-4-kvm.qa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
powerqaworker-qam-1.qa.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker-arm-2.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker-arm-1.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
openqaworker-arm-3.suse.de:
* recover-nfs.service - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.service; static)
Active: inactive (dead)
ERROR: Minions returned with non-zero exit code
openqa:~ # salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs.timer'
openqaworker16.qa.suse.cz:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker18.qa.suse.cz:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker17.qa.suse.cz:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker2.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker3.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker8.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker9.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker6.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker1.qe.nue2.suse.org:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
powerqaworker-qam-1.qa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker14.qa.suse.cz:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
QA-Power8-5-kvm.qa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker13.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
QA-Power8-4-kvm.qa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker11.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker5.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker12.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
worker10.oqa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
malbec.arch.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
grenache-1.qa.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker-arm-1.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker-arm-2.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
openqaworker-arm-3.suse.de:
* recover-nfs.timer - Automatically recover stall nfs shares.
Loaded: loaded (/etc/systemd/system/recover-nfs.timer; disabled; vendor preset: disabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * recover-nfs.service
ERROR: Minions returned with non-zero exit code
To enable the timer I created: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/876
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
Fixed another typo in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/877 and added a suggestion from @mkittler to depend on the required mountpoint where the script is stored. I verified with salt -C 'G@roles:worker' cmd.run 'systemctl status recover-nfs.timer'
that the timer is active and we don't see "Failed systemd services"-alerts so I think we can conclude for now.