action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry - openQA Project (public) - openSUSE Project Management Tool

Actions

action #73282

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry

Added by okurz over 4 years ago. Updated over 3 years ago.

Status:

Workable

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2020-10-13

Due date:

% Done:

Estimated time:

Tags:

worker, minion, cache, download

Description

Observation¶

from https://openqa.suse.de/tests/4811089

[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0147 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:06.0148 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0208 CEST] [error] [pid:36916] Unable to setup job 4811089: Cache service status error from API: Minion job #42464 failed: Worker went away
[2020-10-12T16:22:06.0208 CEST] [debug] [pid:36916] Stopping job 4811089 from openqa.suse.de: 04811089-sle-12-SP4-Server-DVD-Incidents-x86_64-Build:16743:php72-mau-webserver@64bit - reason: setup failure
[2020-10-12T16:22:06.0209 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0272 CEST] [info] [pid:62439] Uploading autoinst-log.txt

Acceptance criteria¶

AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong

Suggestions¶

Improve log messages
Probably the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens

Workaround¶

Retrigger job

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Feature requests
Target version set to Ready

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry added

Actions

Copy link

Updated by okurz over 4 years ago

Tags set to worker, cache, minion, download
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 4 years ago

Priority changed from Normal to Low

Actions

Copy link

Updated by okurz over 4 years ago

Target version changed from Ready to future

Actions

Copy link

Updated by kraih over 4 years ago

Don't think a user error could trigger this, and seeing this error in the worker log should be extremely rare. In fact, i can only think of one scenario to trigger it since the Minion 10.08 release in July. The Minion worker process had to have been killed right before the Minion job process downloading the asset was also killed. Like an unclean restart with SIGKILL. On service restart the Minion worker detects this and fails the Minion job with Worker went away. An admin on the machine must have had a hand in this.

I also looked at the machine, and it currently has the latest version of perl-Minion installed (10.14). Unfortunately it's a busy machine, so there was nothing left from the day in question in the journal, and i couldn't check for service restart timestamps.

My recommendation (at least for now) would be for someone from the team to investigate again (but quicker) if this happens again.

Actions

Copy link

Updated by okurz over 4 years ago

Parent task set to #62420

Actions

Copy link

Updated by okurz over 3 years ago

Parent task changed from #62420 to #102909

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #73282

auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry

Observation¶

Acceptance criteria¶

Suggestions¶

Workaround¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by kraih over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 3 years ago