Project

General

Profile

Actions

action #73288

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry

Added by okurz about 4 years ago. Updated almost 3 years ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-10-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

from https://openqa.suse.de/tests/4812998

[2020-10-13T10:52:32.0254 CEST] [info] [pid:36934] Download of sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2 processed:
[info] [#43340] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
[info] [#43340] Downloading "sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" from "http://openqa.suse.de/tests/4812991/asset/hdd/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[info] [#43340] Cache size 50GiB + needed 2GiB exceeds limit of 50GiB, purging least used assets
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-Build20201013-1-HA-QAM.qcow2" because we need space for new assets, reclaiming 1.1GiB
[info] [#43340] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-Build56.1@64bit-gnome.qcow2" because we need space for new assets, reclaiming 1.6GiB
[info] [#43340] Size of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" is 2GiB, with ETag ""81ad0000-5b183f589249e""
[info] [#43340] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" successful, new cache size is 49GiB

[2020-10-13T10:52:32.0261 CEST] [debug] [pid:36934] Found ISO, caching SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso
[2020-10-13T10:52:32.0263 CEST] [info] [pid:36934] Downloading SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso, request #43343 sent to Cache Service
[2020-10-13T10:52:37.0314 CEST] [info] [pid:36934] Download of SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso processed:
[info] [#43343] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#43343] Downloading "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.suse.de/tests/4812998/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[info] [#43343] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" has not changed, updating last use

[2020-10-13T10:52:37.0318 CEST] [info] [pid:36934] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #43346 sent to Cache Service

with worker-log.txt:

[2020-10-13T10:52:32.0256 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2" to "/var/lib/openqa/pool/16/sle-15-SP1-x86_64-4.12.14-1063.1.ga210de1-Server-DVD-Incidents-Kernel-KOTD@64bit-with-ltp.qcow2"
[2020-10-13T10:52:37.0270 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:37.0271 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:37.0315 CEST] [debug] [pid:36934] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" to "/var/lib/openqa/pool/16/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:42.0326 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0377 CEST] [debug] [pid:36934] Updating status so job 4812998 is not considered dead.
[2020-10-13T10:52:47.0378 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0408 CEST] [error] [pid:36934] Unable to setup job 4812998: Cache service status error from API: Minion job #43346 failed: Job terminated unexpectedly (exit code: 2, signal: 0)
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] Stopping job 4812998 from openqa.suse.de: 04812998-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build4.12.14-1063.1.ga210de1-ltp_sched@64bit - reason: setup failure
[2020-10-13T10:52:47.0408 CEST] [debug] [pid:36934] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4812998/status
[2020-10-13T10:52:47.0467 CEST] [info] [pid:26893] Uploading autoinst-log.txt
[2020-10-13T10:52:47.0502 CEST] [info] [pid:26893] Uploading worker-log.txt

Acceptance criteria

  • AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong

Suggestions

  • Improve log messages
  • Maybe the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens

Workaround

Retrigger job


Related issues 1 (1 open0 closed)

Copied from openQA Project - action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retryWorkable2020-10-13

Actions
Actions #1

Updated by okurz about 4 years ago

  • Copied from action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry added
Actions #2

Updated by okurz about 4 years ago

  • Description updated (diff)
Actions #3

Updated by kraih about 4 years ago

This one is not so harmless. The error message Job terminated unexpectedly (exit code: 2, signal: 0) comes directly from the Minion job queue worker, and means that the process spawned for the download has died unexpectedly. This is very unusual, there is nothing that would make a Minion job process exit with the status code 2 on its own. Normal exceptions would get caught and result in a real exception message as the job result. This is the fallback for a fallback triggering.

Actions #4

Updated by okurz about 4 years ago

  • Priority changed from Normal to Low
Actions #5

Updated by okurz almost 4 years ago

  • Parent task set to #62420
Actions #6

Updated by okurz over 3 years ago

  • Target version changed from Ready to future

The main reason I kept this ticket on the backlog was #73288#note-3 but with no action here and the good workaround with automatic retry in place I think we can live with the current state for the time being.

Actions #7

Updated by okurz almost 3 years ago

  • Parent task changed from #62420 to #102909
Actions

Also available in: Atom PDF