Project

General

Profile

Actions

action #121573

open

Asset/HDD goes missing while job is running

Added by nicksinger about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
QA (public, currently private due to #173521) - future
Start date:
2022-12-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

mdoucha pointed out an interesting case ( https://suse.slack.com/archives/C02CANHLANP/p1670328701486539 ) where a HDD-file was deleted while a job runs: https://openqa.suse.de/tests/10089958#step/oom03/5
I found a similar problem described here: #64544 but this is mainly about "pending" jobs. However, the related change in https://github.com/os-autoinst/openQA/pull/2918/files states that we "Consider all jobs which are not done or cancelled as pending"

With journalctl --since=today -u openqa-worker-cacheservice-minion.service | grep -C 100 "14-819.1.g3e6aee2-Server" I tried to understand when the asset was deleted (any why, most likely because the cache was full) but I only found when openQA/cacheservice downloaded it:

Dec 06 07:58:11 powerqaworker-qam-1 openqa-worker-cacheservice-minion[101810]: [101810] [i] Downloading "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" from "http://openqa.suse.de/tests/10089934/asset/hdd/sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"

Afterwards I just see Downloading: "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" without the http link so I assume this means the asset is still present in the cache (and gets reused).

Acceptance criteria

  • AC1: required assets are not deleted while they are in use by running/pending openQA tests

Suggestions

  • Take a look at our cacheservice-minion code what conditions are required for an asset to be deleted

Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #121579: Logs of openqa-worker-cacheservice-minion are incomplete and inconsistentNew2022-12-06

Actions
Related to openQA Project (public) - action #64544: Asset required by scheduled job wiped by limit_assetsRejectedokurz2020-03-17

Actions
Related to openQA Tests (public) - action #127550: test fails in bootloader_zkvmResolvedokurz2023-04-12

Actions
Related to openQA Project (public) - action #152545: Files have been deleted from the cache while the job was running size:MResolvedmkittler2023-12-132023-12-28

Actions
Copied to openQA Infrastructure (public) - action #127754: osd nfs-server needed to be restarted but we got no alerts size:MResolvednicksinger

Actions
Actions #1

Updated by nicksinger about 2 years ago

  • Related to action #121579: Logs of openqa-worker-cacheservice-minion are incomplete and inconsistent added
Actions #2

Updated by okurz about 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to future
Actions #3

Updated by okurz about 2 years ago

  • Related to action #64544: Asset required by scheduled job wiped by limit_assets added
Actions #4

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #5

Updated by MDoucha almost 2 years ago

Here's a new example where a disk image was downloaded by the job and deleted again within 4 minutes:
https://openqa.suse.de/tests/10174042#step/bootloader_zkvm/18

[2022-12-18T06:18:08.455965+01:00] [info] [pid:25102] Downloading SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2, request #261 sent to Cache Service
[2022-12-18T06:19:39.590090+01:00] [info] [pid:25102] Download of SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 processed:
...
[2022-12-18T06:22:43.854431+01:00] [info] ::: basetest::runtest: # Test died: Unable to find image SLES-15-SP3-s390x-minimal_installed_for_LTP.qcow2 in /var/lib/openqa/share/factory/hdd at sle/lib/bootloader_setup.pm line 1098.
Actions #8

Updated by MDoucha over 1 year ago

Looking at the logs in more detail, all of the last three examples show the same error:

find: /var/lib/openqa/share/factory/hdd/fixed: Stale file handle

The error comes from this function call: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/9a7b43ea9c95bbbc83bde0837681b453c0e90dad/tests/installation/bootloader_svirt.pm#L138

It looks like bootloader_svirt ignores all the asset files already downloaded by cache service and downloads them all again via NFS. Which can result in the "unable to find image" error above if NFS connection fails due to random network issues.

Actions #9

Updated by pdostal over 1 year ago

Actions #11

Updated by ph03nix over 1 year ago

I can reproduce the "Stale file handle" issue on my laptop:

racetrack-7290:/mnt # mount -t nfs openqa.suse.de:/var/lib/openqa/share/factory willie
racetrack-7290:/mnt # cd willie/
racetrack-7290:/mnt/willie # ls
build-to-fixed.sh  hdd  iso  other  repo  tmp
racetrack-7290:/mnt/willie # cd hdd/fixed/
-bash: cd: hdd/fixed/: Stale file handle

Looks like the OSD NFS server could use a restart or at least to unexport and reexport the share in question.

Actions #12

Updated by ph03nix over 1 year ago

I just restarted the NFS server on OSD and try to restart the affected jobs now.

Actions #13

Updated by tinita over 1 year ago

  • Copied to action #127754: osd nfs-server needed to be restarted but we got no alerts size:M added
Actions #14

Updated by okurz 12 months ago

  • Related to action #152545: Files have been deleted from the cache while the job was running size:M added
Actions

Also available in: Atom PDF