Project

General

Profile

action #73282

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry

Added by okurz 6 months ago. Updated 5 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-10-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

from https://openqa.suse.de/tests/4811089

[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0147 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:06.0148 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0208 CEST] [error] [pid:36916] Unable to setup job 4811089: Cache service status error from API: Minion job #42464 failed: Worker went away
[2020-10-12T16:22:06.0208 CEST] [debug] [pid:36916] Stopping job 4811089 from openqa.suse.de: 04811089-sle-12-SP4-Server-DVD-Incidents-x86_64-Build:16743:php72-mau-webserver@64bit - reason: setup failure
[2020-10-12T16:22:06.0209 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0272 CEST] [info] [pid:62439] Uploading autoinst-log.txt

Acceptance criteria

  • AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong

Suggestions

  • Improve log messages
  • Probably the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens

Workaround

Retrigger job


Related issues

Copied to openQA Project - action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retryWorkable2020-10-13

History

#1 Updated by okurz 6 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Feature requests
  • Target version set to Ready

#2 Updated by okurz 6 months ago

  • Copied to action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry added

#3 Updated by okurz 6 months ago

  • Tags set to worker, cache, minion, download
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by okurz 6 months ago

  • Priority changed from Normal to Low

#5 Updated by okurz 5 months ago

  • Target version changed from Ready to future

#6 Updated by kraih 5 months ago

Don't think a user error could trigger this, and seeing this error in the worker log should be extremely rare. In fact, i can only think of one scenario to trigger it since the Minion 10.08 release in July. The Minion worker process had to have been killed right before the Minion job process downloading the asset was also killed. Like an unclean restart with SIGKILL. On service restart the Minion worker detects this and fails the Minion job with Worker went away. An admin on the machine must have had a hand in this.

I also looked at the machine, and it currently has the latest version of perl-Minion installed (10.14). Unfortunately it's a busy machine, so there was nothing left from the day in question in the journal, and i couldn't check for service restart timestamps.

My recommendation (at least for now) would be for someone from the team to investigate again (but quicker) if this happens again.

#7 Updated by okurz 5 months ago

  • Parent task set to #62420

Also available in: Atom PDF