Project

General

Profile

Actions

action #121573

open

Asset/HDD goes missing while job is running

Added by nicksinger about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
QA (public, currently private due to #173521) - future
Start date:
2022-12-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

mdoucha pointed out an interesting case ( https://suse.slack.com/archives/C02CANHLANP/p1670328701486539 ) where a HDD-file was deleted while a job runs: https://openqa.suse.de/tests/10089958#step/oom03/5
I found a similar problem described here: #64544 but this is mainly about "pending" jobs. However, the related change in https://github.com/os-autoinst/openQA/pull/2918/files states that we "Consider all jobs which are not done or cancelled as pending"

With journalctl --since=today -u openqa-worker-cacheservice-minion.service | grep -C 100 "14-819.1.g3e6aee2-Server" I tried to understand when the asset was deleted (any why, most likely because the cache was full) but I only found when openQA/cacheservice downloaded it:

Dec 06 07:58:11 powerqaworker-qam-1 openqa-worker-cacheservice-minion[101810]: [101810] [i] Downloading "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" from "http://openqa.suse.de/tests/10089934/asset/hdd/sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"

Afterwards I just see Downloading: "sle-12-SP5-ppc64le-4.12.14-819.1.g3e6aee2-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2" without the http link so I assume this means the asset is still present in the cache (and gets reused).

Acceptance criteria

  • AC1: required assets are not deleted while they are in use by running/pending openQA tests

Suggestions

  • Take a look at our cacheservice-minion code what conditions are required for an asset to be deleted

Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #121579: Logs of openqa-worker-cacheservice-minion are incomplete and inconsistentNew2022-12-06

Actions
Related to openQA Project (public) - action #64544: Asset required by scheduled job wiped by limit_assetsRejectedokurz2020-03-17

Actions
Related to openQA Tests (public) - action #127550: test fails in bootloader_zkvmResolvedokurz2023-04-12

Actions
Related to openQA Project (public) - action #152545: Files have been deleted from the cache while the job was running size:MResolvedmkittler2023-12-132023-12-28

Actions
Copied to openQA Infrastructure (public) - action #127754: osd nfs-server needed to be restarted but we got no alerts size:MResolvednicksinger

Actions
Actions

Also available in: Atom PDF