Project

General

Profile

action #73288

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry

Added by okurz 6 months ago. Updated about 2 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-10-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

from https://openqa.suse.de/tests/4812998

[2020-10-13T10:52:32.0254 CEST] [info] [pid:36934] Download of sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2 processed:
[info] [#43340] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
[info] [#43340] Downloading "sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" from "http://openqa.suse.de/tests/4812991/asset/hdd/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[info] [#43340] Cache size 50GiB + needed 2GiB exceeds limit of 50GiB, purging least used assets
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-Build20201013-1-HA-QAM.qcow2" because we need space for new assets, reclaiming 1.1GiB
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-Build56.1@64bit-gnome.qcow2" because we need space for new assets, reclaiming 1.6GiB
[info] [#43340] Size of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" is 2GiB, with ETag ""81ad0000-5b183f589249e""
[info] [#43340] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" successful, new cache size is 49GiB

[2020-10-13T10:52:32.0261 CEST] [debug] [pid:36934] Found ISO, caching SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso
[2020-10-13T10:52:32.0263 CEST] [info] [pid:36934] Downloading SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso, request #43343 sent to Cache Service
[2020-10-13T10:52:37.0314 CEST] [info] [pid:36934] Download of SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso processed:
[info] [#43343] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#43343] Downloading "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.suse.de/tests/4812998/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[info] [#43343] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" has not changed, updating last use

[2020-10-13T10:52:37.0318 CEST] [info] [pid:36934] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #43346 sent to Cache Service

with worker-log.txt:

[2020-10-13T10:52:32.0256 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" to "/var/lib/openqa/pool/16/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[2020-10-13T10:52:37.0270 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:37.0271 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:37.0315 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" to "/var/lib/openqa/pool/16/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0377 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:47.0378 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0408 CEST] [error] [pid:36934] Unable to setup job 4812998: Cache service status error from API: Minion job #43346 failed: Job terminated unexpectedly (exit code: 2, signal: 0)
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] Stopping job 4812998 from openqa.suse.de: 04812998-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build4.12.14-1063.1.ga210de1-ltp_sched@64bit - reason: setup failure
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0467 CEST] [info] [pid:26893] Uploading autoinst-log.txt
[2020-10-13T10:52:47.0502 CEST] [info] [pid:26893] Uploading worker-log.txt

Acceptance criteria

  • AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong

Suggestions

  • Improve log messages
  • Maybe the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens

Workaround

Retrigger job


Related issues

Copied from openQA Project - action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retryWorkable2020-10-13

History

#1 Updated by okurz 6 months ago

  • Copied from action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry added

#2 Updated by okurz 6 months ago

  • Description updated (diff)

#3 Updated by kraih 6 months ago

This one is not so harmless. The error message Job terminated unexpectedly (exit code: 2, signal: 0) comes directly from the Minion job queue worker, and means that the process spawned for the download has died unexpectedly. This is very unusual, there is nothing that would make a Minion job process exit with the status code 2 on its own. Normal exceptions would get caught and result in a real exception message as the job result. This is the fallback for a fallback triggering.

#4 Updated by okurz 6 months ago

  • Priority changed from Normal to Low

#5 Updated by okurz 5 months ago

  • Parent task set to #62420

#6 Updated by okurz about 2 months ago

  • Target version changed from Ready to future

The main reason I kept this ticket on the backlog was #73288#note-3 but with no action here and the good workaround with automatic retry in place I think we can live with the current state for the time being.

Also available in: Atom PDF