Project

General

Profile

Actions

action #132434

closed

Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M

Added by mkittler over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-07-07
Due date:
2023-07-26
% Done:

0%

Estimated time:

Description

Observation

The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME with the following logs:

[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt

The Minion job responsible for the download failed with:

---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
  lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20

Considering

martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago

this download job resulting in Job terminated unexpectedly is because openqa-worker-cacheservice-minion.service has been restarted.

Acceptance criteria

What we observed is likely not completely avoidable but I think we can do the following improvements:

  • AC1: The cache_asset jobs are avoided to run into Job terminated unexpectedly as much as possible, e.g. by scheduling a retry of the job and terminating immediately when receiving SIGTERM.
  • AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.
Actions #1

Updated by okurz over 1 year ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2023-07-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz over 1 year ago

  • Subject changed from Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` to Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M
Actions #7

Updated by okurz over 1 year ago

  • Status changed from In Progress to Feedback

Minion job cache_asset failed and jobs runs into setup exceeded MAX_SETUP_TIME

Observation

The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME with the following logs:

[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt

The Minion job responsible for the download failed with:

---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
  lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20

Considering

martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago

this download job resulting in Job terminated unexpectedly is because openqa-worker-cacheservice-minion.service has been restarted.

Acceptance criteria

What we observed is likely not completely avoidable but I think we can do the following improvements:

  • AC1: The cache_asset minion jobs are avoided to run into Job terminated unexpectedly as much as possible, e.g. by scheduling a retry of the minion job and terminating immediately when receiving SIGTERM
  • AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.
Actions #8

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

The PRs have been merged. Considering this was hard enough to reproduce locally I wouldn't wait for encountering such a case again in production and resolve the ticket right now.

Actions

Also available in: Atom PDF