action #73282
opencoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA
auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retry
0%
Description
Observation¶
from https://openqa.suse.de/tests/4811089
[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:01.0085 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0147 CEST] [debug] [pid:36916] Updating status so job 4811089 is not considered dead.
[2020-10-12T16:22:06.0148 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0208 CEST] [error] [pid:36916] Unable to setup job 4811089: Cache service status error from API: Minion job #42464 failed: Worker went away
[2020-10-12T16:22:06.0208 CEST] [debug] [pid:36916] Stopping job 4811089 from openqa.suse.de: 04811089-sle-12-SP4-Server-DVD-Incidents-x86_64-Build:16743:php72-mau-webserver@64bit - reason: setup failure
[2020-10-12T16:22:06.0209 CEST] [debug] [pid:36916] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811089/status
[2020-10-12T16:22:06.0272 CEST] [info] [pid:62439] Uploading autoinst-log.txt
Acceptance criteria¶
- AC1: Either job is automatically restarted or the user receives a clear message what the user did wrong
Suggestions¶
- Improve log messages
- Probably the worker service restarted but this did not retrigger the job. Better restart openQA jobs automatically when this happens
Workaround¶
Retrigger job
Updated by okurz almost 4 years ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Feature requests
- Target version set to Ready
Updated by okurz almost 4 years ago
- Copied to action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retry added
Updated by okurz almost 4 years ago
- Tags set to worker, cache, minion, download
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih almost 4 years ago
Don't think a user error could trigger this, and seeing this error in the worker log should be extremely rare. In fact, i can only think of one scenario to trigger it since the Minion 10.08 release in July. The Minion worker process had to have been killed right before the Minion job process downloading the asset was also killed. Like an unclean restart with SIGKILL. On service restart the Minion worker detects this and fails the Minion job with Worker went away
. An admin on the machine must have had a hand in this.
I also looked at the machine, and it currently has the latest version of perl-Minion installed (10.14). Unfortunately it's a busy machine, so there was nothing left from the day in question in the journal, and i couldn't check for service restart timestamps.
My recommendation (at least for now) would be for someone from the team to investigate again (but quicker) if this happens again.