Project

General

Profile

action #109046

[tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry

Added by jlausuch 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP4-Migration-from-SLE12-SPx-s390x-offline_sles12sp3_ltss_pscc_asmm-lgm_all_full@s390x-kvm-sle15 fails in
bootloader_zkvm
But the qcow exists:

$ /var/lib/openqa/share/factory/hdd/fixed> ll SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2
-rw-r--r-- 1 geekotest nogroup 14485946368 Jan 14 08:45 SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2

And the asset is actually there:

It happens in all jobs with svirt backend that have their HDD in fixed directory. So, all s390x jobs and some others that use that backend too:
Examples:
https://openqa.suse.de/tests/8414273
https://openqa.suse.de/tests/8414268
https://openqa.suse.de/tests/8414272
https://openqa.suse.de/tests/8414267
https://openqa.suse.de/tests/8414271
https://openqa.suse.de/tests/8414266
https://openqa.suse.de/tests/8414270
https://openqa.suse.de/tests/8414265
https://openqa.suse.de/tests/8414269
https://openqa.suse.de/tests/8414264

https://openqa.suse.de/tests/8414323
https://openqa.suse.de/tests/8414322
https://openqa.suse.de/tests/8414321
https://openqa.suse.de/tests/8414320

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build 113.1

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#109046

Expected result

Last good: 108.1 (or more recent)

Further details

Always latest result in this scenario: latest

hdd.png (30.7 KB) hdd.png jlausuch, 2022-03-28 06:19
13001

Related issues

Copied to openQA Tests - action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problemsNew2022-03-28

History

#1 Updated by mloviska 3 months ago

Seems like a NFS issue, normally re-mount usually help. However, this time it has not.

openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle

#2 Updated by jlausuch 3 months ago

mloviska wrote:

Seems like a NFS issue, normally re-mount usually help. However, this time it has not.

openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle

Yes, I also have the same problem mounting the NFS on my environment. Must be that.

This was happening since Sunday. I tried to investigate a bit, but I only found out that the fixed directory was owned by root and with all permissions:
drwxrwxrwx 2 geekotest root 49152 Mar 24 03:05 fixed

So, I changed it to geekotest:nogroup and with standard permissions (as the parent dir hdd):
drwxr-xr-x 2 geekotest nogroup 49152 Mar 24 03:05 fixed

However, that didn't solve the problem, so it must be something else.

#3 Updated by okurz 3 months ago

  • Subject changed from Unable to find image defined in HDD_1 in fixed directory for svirt backend to [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend

As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.

#4 Updated by jlausuch 3 months ago

okurz wrote:

As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.

Ok, thanks. Maybe restarting NFS service would help?
Seems that after mounting NFS, fixed dir is giving that issue Martin mentioned.

#5 Updated by okurz 3 months ago

  • Subject changed from [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend
  • Assignee set to okurz
  • Target version set to Ready

https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/ suggests to run a script in a cron job like every 5 minutes to check for such message

I manually worked around the problem for now with

umount -l /var/lib/openqa/share
mount -a

and I can access files in the share again. Let me think of improvements.

#6 Updated by okurz 3 months ago

  • Copied to action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problems added

#7 Updated by okurz 3 months ago

  • Status changed from Workable to Resolved

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

#8 Updated by jlausuch 3 months ago

okurz wrote:

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now access fixed dir.

#9 Updated by okurz 3 months ago

  • Subject changed from [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry
  • Status changed from Resolved to Feedback

jlausuch wrote:

okurz wrote:

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now access fixed dir.

Good to hear that.

Actually reopening, running a manual run of

export host=openqa.suse.de; bash -ex ./openqa-monitor-investigation-candidates | bash -e ./openqa-label-known-issues

Also labeled and retriggered multiple jobs manually.

#10 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved
$ openqa-query-for-job-label poo#109046
8429356|2022-03-29 07:53:52|done|failed|msdos||openqaworker2
8429355|2022-03-29 07:53:47|done|failed|minimal+base_yast||openqaworker2
8429350|2022-03-29 07:53:25|done|failed|lvm+RAID1||openqaworker2
8429351|2022-03-29 07:53:24|done|failed|minimal+base_yast||openqaworker2
8421641|2022-03-28 11:06:39|done|failed|msdos||openqaworker2
8421429|2022-03-28 10:41:33|done|failed|minimal+base_yast||openqaworker2
8421428|2022-03-28 10:35:50|done|failed|lvm+RAID1||openqaworker2
8420964|2022-03-28 10:29:16|done|failed|minimal+base_yast||openqaworker2
8419261|2022-03-28 09:32:13|done|failed|jeos-extratest||openqaworker2
8419259|2022-03-28 09:29:38|done|failed|jeos-filesystem||openqaworker2

looks good now

Also available in: Atom PDF