action #121573
openAsset/HDD goes missing while job is running
0%
Description
Observation¶
mdoucha pointed out an interesting case ( https://suse.slack.com/archives/C02CANHLANP/p1670328701486539 ) where a HDD-file was deleted while a job runs: https://openqa.suse.de/tests/10089958#step/oom03/5
I found a similar problem described here: #64544 but this is mainly about "pending" jobs. However, the related change in https://github.com/os-autoinst/openQA/pull/2918/files states that we "Consider all jobs which are not done or cancelled as pending"
With journalctl --since=today -u openqa-worker-cacheservice-minion.service | grep -C 100 "14-819.1.g3e6aee2-Server"
I tried to understand when the asset was deleted (any why, most likely because the cache was full) but I only found when openQA/cacheservice downloaded it:
Dec 06 07:58:11 powerqaworker-qam-1 openqa-worker-cacheservice-minion[101810]: [101810] [i] Downloading "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" from "http://openqa.suse.de/tests/10089934/asset/hdd/sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"
Afterwards I just see Downloading: "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"
without the http link so I assume this means the asset is still present in the cache (and gets reused).
Acceptance criteria¶
- AC1: required assets are not deleted while they are in use by running/pending openQA tests
Suggestions¶
- Take a look at our cacheservice-minion code what conditions are required for an asset to be deleted
Updated by nicksinger about 2 years ago
- Related to action #121579: Logs of openqa-worker-cacheservice-minion are incomplete and inconsistent added
Updated by okurz about 2 years ago
- Category set to Regressions/Crashes
- Target version set to future
Updated by okurz about 2 years ago
- Related to action #64544: Asset required by scheduled job wiped by limit_assets added
Updated by MDoucha about 2 years ago
Here's a new example where a disk image was downloaded by the job and deleted again within 4 minutes:
https://openqa.suse.de/tests/10174042#step/bootloader_zkvm/18
[2022-12-18T06:18:08.455965+01:00] [info] [pid:25102] Downloading SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2, request #261 sent to Cache Service
[2022-12-18T06:19:39.590090+01:00] [info] [pid:25102] Download of SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 processed:
...
[2022-12-18T06:22:43.854431+01:00] [info] ::: basetest::runtest: # Test died: Unable to find image SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 in /var/lib/openqa/share/factory/hdd at sle/lib/bootloader_setup.pm line 1098.
Updated by MDoucha almost 2 years ago
We had 6 new examples of this issue over the weekend, all on s390x:
https://openqa.suse.de/tests/10360482
https://openqa.suse.de/tests/10360528
https://openqa.suse.de/tests/10359966
https://openqa.suse.de/tests/10359968
https://openqa.suse.de/tests/10363049
https://openqa.suse.de/tests/10363051
Updated by MDoucha over 1 year ago
We got more examples today, again on s390x:
https://openqa.suse.de/tests/10894270
https://openqa.suse.de/tests/10894414
https://openqa.suse.de/tests/10891227
Updated by MDoucha over 1 year ago
Looking at the logs in more detail, all of the last three examples show the same error:
find: /var/lib/openqa/share/factory/hdd/fixed: Stale file handle
The error comes from this function call: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/9a7b43ea9c95bbbc83bde0837681b453c0e90dad/tests/installation/bootloader_svirt.pm#L138
It looks like bootloader_svirt ignores all the asset files already downloaded by cache service and downloads them all again via NFS. Which can result in the "unable to find image" error above if NFS connection fails due to random network issues.
Updated by pdostal over 1 year ago
- Related to action #127550: test fails in bootloader_zkvm added
Updated by ph03nix over 1 year ago
I can reproduce the "Stale file handle" issue on my laptop:
racetrack-7290:/mnt # mount -t nfs openqa.suse.de:/var/lib/openqa/share/factory willie
racetrack-7290:/mnt # cd willie/
racetrack-7290:/mnt/willie # ls
build-to-fixed.sh hdd iso other repo tmp
racetrack-7290:/mnt/willie # cd hdd/fixed/
-bash: cd: hdd/fixed/: Stale file handle
Looks like the OSD NFS server could use a restart or at least to unexport and reexport the share in question.
Updated by ph03nix over 1 year ago
I just restarted the NFS server on OSD and try to restart the affected jobs now.
Updated by tinita over 1 year ago
- Copied to action #127754: osd nfs-server needed to be restarted but we got no alerts size:M added
Updated by okurz about 1 year ago
- Related to action #152545: Files have been deleted from the cache while the job was running size:M added