action #121573
open
Asset/HDD goes missing while job is running
Added by nicksinger about 2 years ago.
Updated over 1 year ago.
Category:
Regressions/Crashes
Description
Observation¶
mdoucha pointed out an interesting case ( https://suse.slack.com/archives/C02CANHLANP/p1670328701486539 ) where a HDD-file was deleted while a job runs: https://openqa.suse.de/tests/10089958#step/oom03/5
I found a similar problem described here: #64544 but this is mainly about "pending" jobs. However, the related change in https://github.com/os-autoinst/openQA/pull/2918/files states that we "Consider all jobs which are not done or cancelled as pending"
With journalctl --since=today -u openqa-worker-cacheservice-minion.service | grep -C 100 "14-819.1.g3e6aee2-Server"
I tried to understand when the asset was deleted (any why, most likely because the cache was full) but I only found when openQA/cacheservice downloaded it:
Dec 06 07:58:11 powerqaworker-qam-1 openqa-worker-cacheservice-minion[101810]: [101810] [i] Downloading "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" from "http://openqa.suse.de/tests/10089934/asset/hdd/sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"
Afterwards I just see Downloading: "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"
without the http link so I assume this means the asset is still present in the cache (and gets reused).
Acceptance criteria¶
- AC1: required assets are not deleted while they are in use by running/pending openQA tests
Suggestions¶
- Take a look at our cacheservice-minion code what conditions are required for an asset to be deleted
- Related to action #121579: Logs of openqa-worker-cacheservice-minion are incomplete and inconsistent added
- Category set to Regressions/Crashes
- Target version set to future
- Related to action #64544: Asset required by scheduled job wiped by limit_assets added
- Description updated (diff)
Here's a new example where a disk image was downloaded by the job and deleted again within 4 minutes:
https://openqa.suse.de/tests/10174042#step/bootloader_zkvm/18
[2022-12-18T06:18:08.455965+01:00] [info] [pid:25102] Downloading SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2, request #261 sent to Cache Service
[2022-12-18T06:19:39.590090+01:00] [info] [pid:25102] Download of SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 processed:
...
[2022-12-18T06:22:43.854431+01:00] [info] ::: basetest::runtest: # Test died: Unable to find image SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 in /var/lib/openqa/share/factory/hdd at sle/lib/bootloader_setup.pm line 1098.
I can reproduce the "Stale file handle" issue on my laptop:
racetrack-7290:/mnt # mount -t nfs openqa.suse.de:/var/lib/openqa/share/factory willie
racetrack-7290:/mnt # cd willie/
racetrack-7290:/mnt/willie # ls
build-to-fixed.sh hdd iso other repo tmp
racetrack-7290:/mnt/willie # cd hdd/fixed/
-bash: cd: hdd/fixed/: Stale file handle
Looks like the OSD NFS server could use a restart or at least to unexport and reexport the share in question.
I just restarted the NFS server on OSD and try to restart the affected jobs now.
- Copied to action #127754: osd nfs-server needed to be restarted but we got no alerts size:M added
- Related to action #152545: Files have been deleted from the cache while the job was running size:M added
Also available in: Atom
PDF