action #73288
opencoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA
auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry
0%
Description
Observation¶
from https://openqa.suse.de/tests/4812998
[2020-10-13T10:52:32.0254 CEST] [info] [pid:36934] Download of sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2 processed:
[info] [#43340] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
[info] [#43340] Downloading "sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" from "http://openqa.suse.de/tests/4812991/asset/hdd/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[info] [#43340] Cache size 50GiB + needed 2GiB exceeds limit of 50GiB, purging least used assets
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-Build20201013-1-HA-QAM.qcow2" because we need space for new assets, reclaiming 1.1GiB
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-Build56.1@64bit-gnome.qcow2" because we need space for new assets, reclaiming 1.6GiB
[info] [#43340] Size of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" is 2GiB, with ETag ""81ad0000-5b183f589249e""
[info] [#43340] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" successful, new cache size is 49GiB
[2020-10-13T10:52:32.0261 CEST] [debug] [pid:36934] Found ISO, caching SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso
[2020-10-13T10:52:32.0263 CEST] [info] [pid:36934] Downloading SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso, request #43343 sent to Cache Service
[2020-10-13T10:52:37.0314 CEST] [info] [pid:36934] Download of SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso processed:
[info] [#43343] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#43343] Downloading "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.suse.de/tests/4812998/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[info] [#43343] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" has not changed, updating last use
[2020-10-13T10:52:37.0318 CEST] [info] [pid:36934] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #43346 sent to Cache Service
with worker-log.txt:
[2020-10-13T10:52:32.0256 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" to "/var/lib/openqa/pool/16/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[2020-10-13T10:52:37.0270 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:37.0271 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:37.0315 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" to "/var/lib/openqa/pool/16/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0377 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:47.0378 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0408 CEST] [error] [pid:36934] Unable to setup job 4812998: Cache service status error from API: Minion job #43346 failed: Job terminated unexpectedly (exit code: 2, signal: 0)
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] Stopping job 4812998 from openqa.suse.de: 04812998-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build4.12.14-1063.1.ga210de1-ltp_sched@64bit - reason: setup failure
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0467 CEST] [info] [pid:26893] Uploading autoinst-log.txt
[2020-10-13T10:52:47.0502 CEST] [info] [pid:26893] Uploading worker-log.txt
Acceptance criteria¶
- AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong
Suggestions¶
- Improve log messages
- Maybe the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens
Workaround¶
Retrigger job
Updated by okurz about 4 years ago
- Copied from action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry added
Updated by kraih about 4 years ago
This one is not so harmless. The error message Job terminated unexpectedly (exit code: 2, signal: 0)
comes directly from the Minion job queue worker, and means that the process spawned for the download has died unexpectedly. This is very unusual, there is nothing that would make a Minion job process exit with the status code 2 on its own. Normal exceptions would get caught and result in a real exception message as the job result. This is the fallback for a fallback triggering.
Updated by okurz over 3 years ago
- Target version changed from Ready to future
The main reason I kept this ticket on the backlog was #73288#note-3 but with no action here and the good workaround with automatic retry in place I think we can live with the current state for the time being.