action #109046
closed[tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry
0%
Description
Observation¶
openQA test in scenario sle-15-SP4-Migration-from-SLE12-SPx-s390x-offline_sles12sp3_ltss_pscc_asmm-lgm_all_full@s390x-kvm-sle15 fails in
bootloader_zkvm
But the qcow exists:
$ /var/lib/openqa/share/factory/hdd/fixed> ll SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2
-rw-r--r-- 1 geekotest nogroup 14485946368 Jan 14 08:45 SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2
And the asset is actually there:
It happens in all jobs with svirt
backend that have their HDD in fixed directory. So, all s390x jobs and some others that use that backend too:
Examples:
https://openqa.suse.de/tests/8414273
https://openqa.suse.de/tests/8414268
https://openqa.suse.de/tests/8414272
https://openqa.suse.de/tests/8414267
https://openqa.suse.de/tests/8414271
https://openqa.suse.de/tests/8414266
https://openqa.suse.de/tests/8414270
https://openqa.suse.de/tests/8414265
https://openqa.suse.de/tests/8414269
https://openqa.suse.de/tests/8414264
https://openqa.suse.de/tests/8414323
https://openqa.suse.de/tests/8414322
https://openqa.suse.de/tests/8414321
https://openqa.suse.de/tests/8414320
Test suite description¶
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.
Reproducible¶
Fails since (at least) Build 113.1
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#109046
Expected result¶
Last good: 108.1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Files
Updated by mloviska over 2 years ago
Seems like a NFS issue, normally re-mount usually help. However, this time it has not.
openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
Updated by jlausuch over 2 years ago
mloviska wrote:
Seems like a NFS issue, normally re-mount usually help. However, this time it has not.
openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2 find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
Yes, I also have the same problem mounting the NFS on my environment. Must be that.
This was happening since Sunday. I tried to investigate a bit, but I only found out that the fixed directory was owned by root
and with all permissions:
drwxrwxrwx 2 geekotest root 49152 Mar 24 03:05 fixed
So, I changed it to geekotest:nogroup
and with standard permissions (as the parent dir hdd
):
drwxr-xr-x 2 geekotest nogroup 49152 Mar 24 03:05 fixed
However, that didn't solve the problem, so it must be something else.
Updated by okurz over 2 years ago
- Subject changed from Unable to find image defined in HDD_1 in fixed directory for svirt backend to [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend
As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.
Updated by jlausuch over 2 years ago
okurz wrote:
As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.
Ok, thanks. Maybe restarting NFS service would help?
Seems that after mounting NFS, fixed
dir is giving that issue Martin mentioned.
Updated by okurz over 2 years ago
- Subject changed from [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend
- Assignee set to okurz
- Target version set to Ready
https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/ suggests to run a script in a cron job like every 5 minutes to check for such message
I manually worked around the problem for now with
umount -l /var/lib/openqa/share
mount -a
and I can access files in the share again. Let me think of improvements.
Updated by okurz over 2 years ago
- Copied to action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problems added
Updated by okurz over 2 years ago
- Status changed from Workable to Resolved
According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a
on the server. I did that right now but I wonder when this should be done automatically.
https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085
Updated by jlausuch over 2 years ago
okurz wrote:
According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call
exportfs -ua && exportfs -a
on the server. I did that right now but I wonder when this should be done automatically.https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085
Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now access fixed
dir.
Updated by okurz over 2 years ago
- Subject changed from [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry
- Status changed from Resolved to Feedback
jlausuch wrote:
okurz wrote:
According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call
exportfs -ua && exportfs -a
on the server. I did that right now but I wonder when this should be done automatically.https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085
Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now accessfixed
dir.
Good to hear that.
Actually reopening, running a manual run of
export host=openqa.suse.de; bash -ex ./openqa-monitor-investigation-candidates | bash -e ./openqa-label-known-issues
Also labeled and retriggered multiple jobs manually.
Updated by okurz over 2 years ago
- Description updated (diff)
- Status changed from Feedback to Resolved
$ openqa-query-for-job-label poo#109046
8429356|2022-03-29 07:53:52|done|failed|msdos||openqaworker2
8429355|2022-03-29 07:53:47|done|failed|minimal+base_yast||openqaworker2
8429350|2022-03-29 07:53:25|done|failed|lvm+RAID1||openqaworker2
8429351|2022-03-29 07:53:24|done|failed|minimal+base_yast||openqaworker2
8421641|2022-03-28 11:06:39|done|failed|msdos||openqaworker2
8421429|2022-03-28 10:41:33|done|failed|minimal+base_yast||openqaworker2
8421428|2022-03-28 10:35:50|done|failed|lvm+RAID1||openqaworker2
8420964|2022-03-28 10:29:16|done|failed|minimal+base_yast||openqaworker2
8419261|2022-03-28 09:32:13|done|failed|jeos-extratest||openqaworker2
8419259|2022-03-28 09:29:38|done|failed|jeos-filesystem||openqaworker2
looks good now