Project

General

Profile

Actions

action #73282

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry

Added by okurz almost 4 years ago. Updated almost 3 years ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-10-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

from https://openqa.suse.de/tests/4811089

[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0147 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:06.0148 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0208 CEST] [error] [pid:36916] Unable to setup job 4811089: Cache service status error from API: Minion job #42464 failed: Worker went away
[2020-10-12T16:22:06.0208 CEST] [debug] [pid:36916] Stopping job 4811089 from openqa.suse.de: 04811089-sle-12-SP4-Server-DVD-Incidents-x86_64-Build:16743:php72-mau-webserver@64bit - reason: setup failure
[2020-10-12T16:22:06.0209 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0272 CEST] [info] [pid:62439] Uploading autoinst-log.txt

Acceptance criteria

  • AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong

Suggestions

  • Improve log messages
  • Probably the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens

Workaround

Retrigger job


Related issues 1 (1 open0 closed)

Copied to openQA Project - action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retryWorkable2020-10-13

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Feature requests
  • Target version set to Ready
Actions #2

Updated by okurz almost 4 years ago

  • Copied to action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry added
Actions #3

Updated by okurz almost 4 years ago

  • Tags set to worker, cache, minion, download
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz almost 4 years ago

  • Priority changed from Normal to Low
Actions #5

Updated by okurz almost 4 years ago

  • Target version changed from Ready to future
Actions #6

Updated by kraih almost 4 years ago

Don't think a user error could trigger this, and seeing this error in the worker log should be extremely rare. In fact, i can only think of one scenario to trigger it since the Minion 10.08 release in July. The Minion worker process had to have been killed right before the Minion job process downloading the asset was also killed. Like an unclean restart with SIGKILL. On service restart the Minion worker detects this and fails the Minion job with Worker went away. An admin on the machine must have had a hand in this.

I also looked at the machine, and it currently has the latest version of perl-Minion installed (10.14). Unfortunately it's a busy machine, so there was nothing left from the day in question in the journal, and i couldn't check for service restart timestamps.

My recommendation (at least for now) would be for someone from the team to investigate again (but quicker) if this happens again.

Actions #7

Updated by okurz almost 4 years ago

  • Parent task set to #62420
Actions #8

Updated by okurz almost 3 years ago

  • Parent task changed from #62420 to #102909
Actions

Also available in: Atom PDF