Project

General

Profile

Actions

action #109046

closed

[tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry

Added by jlausuch about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP4-Migration-from-SLE12-SPx-s390x-offline_sles12sp3_ltss_pscc_asmm-lgm_all_full@s390x-kvm-sle15 fails in
bootloader_zkvm
But the qcow exists:

$ /var/lib/openqa/share/factory/hdd/fixed> ll SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2
-rw-r--r-- 1 geekotest nogroup 14485946368 Jan 14 08:45 SLES-12-SP3-s390x-GM-gnome-allpatterns.qcow2

And the asset is actually there:

It happens in all jobs with svirt backend that have their HDD in fixed directory. So, all s390x jobs and some others that use that backend too:
Examples:
https://openqa.suse.de/tests/8414273
https://openqa.suse.de/tests/8414268
https://openqa.suse.de/tests/8414272
https://openqa.suse.de/tests/8414267
https://openqa.suse.de/tests/8414271
https://openqa.suse.de/tests/8414266
https://openqa.suse.de/tests/8414270
https://openqa.suse.de/tests/8414265
https://openqa.suse.de/tests/8414269
https://openqa.suse.de/tests/8414264

https://openqa.suse.de/tests/8414323
https://openqa.suse.de/tests/8414322
https://openqa.suse.de/tests/8414321
https://openqa.suse.de/tests/8414320

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build 113.1

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#109046

Expected result

Last good: 108.1 (or more recent)

Further details

Always latest result in this scenario: latest


Files

hdd.png (30.7 KB) hdd.png jlausuch, 2022-03-28 06:19

Related issues 1 (1 open0 closed)

Copied to openQA Tests - action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problemsNew2022-03-28

Actions
Actions #1

Updated by mloviska about 2 years ago

Seems like a NFS issue, normally re-mount usually help. However, this time it has not.

openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
Actions #2

Updated by jlausuch about 2 years ago

mloviska wrote:

Seems like a NFS issue, normally re-mount usually help. However, this time it has not.

openqaw5-xen:~ # find /var/lib/openqa/share/factory/hdd /var/lib/openqa/share/factory/hdd/fixed -name SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle
find: ‘/var/lib/openqa/share/factory/hdd/fixed’: Stale file handle

Yes, I also have the same problem mounting the NFS on my environment. Must be that.

This was happening since Sunday. I tried to investigate a bit, but I only found out that the fixed directory was owned by root and with all permissions:
drwxrwxrwx 2 geekotest root 49152 Mar 24 03:05 fixed

So, I changed it to geekotest:nogroup and with standard permissions (as the parent dir hdd):
drwxr-xr-x 2 geekotest nogroup 49152 Mar 24 03:05 fixed

However, that didn't solve the problem, so it must be something else.

Actions #3

Updated by okurz about 2 years ago

  • Subject changed from Unable to find image defined in HDD_1 in fixed directory for svirt backend to [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend

As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.

Actions #4

Updated by jlausuch about 2 years ago

okurz wrote:

As you stated, the file does exist on osd and osd serves the directory over NFS so the problem is likely on the hypervisor host. I assume if you don't pick it up for qac then you expect qe-core to look at that.

Ok, thanks. Maybe restarting NFS service would help?
Seems that after mounting NFS, fixed dir is giving that issue Martin mentioned.

Actions #5

Updated by okurz about 2 years ago

  • Subject changed from [qe-core] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend
  • Assignee set to okurz
  • Target version set to Ready

https://engineerworkshop.com/blog/automatically-resolve-nfs-stale-file-handle-errors-in-ubuntu-linux/ suggests to run a script in a cron job like every 5 minutes to check for such message

I manually worked around the problem for now with

umount -l /var/lib/openqa/share
mount -a

and I can access files in the share again. Let me think of improvements.

Actions #6

Updated by okurz about 2 years ago

  • Copied to action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problems added
Actions #7

Updated by okurz about 2 years ago

  • Status changed from Workable to Resolved

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

Actions #8

Updated by jlausuch about 2 years ago

okurz wrote:

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now access fixed dir.

Actions #9

Updated by okurz about 2 years ago

  • Subject changed from [tools] Unable to find image defined in HDD_1 in fixed directory for svirt backend to [tools] auto_review:"Unable to find image SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2.*svirt":retry
  • Status changed from Resolved to Feedback

jlausuch wrote:

okurz wrote:

According to my research, e.g. https://unix.stackexchange.com/questions/433051/mount-nfs-stale-file-handle-error-cannot-umount , this is a situation which can just happen. The best solution I saw right now is to check periodically in a cron job on the client. For that I think the situation does not appear often enough. https://unix.stackexchange.com/a/447581 suggests to call exportfs -ua && exportfs -a on the server. I did that right now but I wonder when this should be done automatically.

https://unix.stackexchange.com/a/433071 suggests that the problem might be due to an outdated NFS4 client. So maybe the best course of action would be to ensure that openqaw5-xen itself is updated to a more current OS -> Created a specific ticket about that to "[qe-core]" to handle that in #109085

Ok, that makes sense. Thanks for taking care.
Btw, on my client side, I can now access fixed dir.

Good to hear that.

Actually reopening, running a manual run of

export host=openqa.suse.de; bash -ex ./openqa-monitor-investigation-candidates | bash -e ./openqa-label-known-issues

Also labeled and retriggered multiple jobs manually.

Actions #10

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved
$ openqa-query-for-job-label poo#109046
8429356|2022-03-29 07:53:52|done|failed|msdos||openqaworker2
8429355|2022-03-29 07:53:47|done|failed|minimal+base_yast||openqaworker2
8429350|2022-03-29 07:53:25|done|failed|lvm+RAID1||openqaworker2
8429351|2022-03-29 07:53:24|done|failed|minimal+base_yast||openqaworker2
8421641|2022-03-28 11:06:39|done|failed|msdos||openqaworker2
8421429|2022-03-28 10:41:33|done|failed|minimal+base_yast||openqaworker2
8421428|2022-03-28 10:35:50|done|failed|lvm+RAID1||openqaworker2
8420964|2022-03-28 10:29:16|done|failed|minimal+base_yast||openqaworker2
8419261|2022-03-28 09:32:13|done|failed|jeos-extratest||openqaworker2
8419259|2022-03-28 09:29:38|done|failed|jeos-filesystem||openqaworker2

looks good now

Actions

Also available in: Atom PDF