action #132434: Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #132434

closed

Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M

Added by mkittler almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2023-07-07

Due date:

2023-07-26

% Done:

Estimated time:

Description

Observation¶

The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME with the following logs:

[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt

The Minion job responsible for the download failed with:

---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
  lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20

Considering

martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago

this download job resulting in Job terminated unexpectedly is because openqa-worker-cacheservice-minion.service has been restarted.

Acceptance criteria¶

What we observed is likely not completely avoidable but I think we can do the following improvements:

AC1: The cache_asset jobs are avoided to run into Job terminated unexpectedly as much as possible, e.g. by scheduling a retry of the job and terminating immediately when receiving SIGTERM.
AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.

Actions

Copy link

Updated by okurz almost 2 years ago

Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by mkittler almost 2 years ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler almost 2 years ago

PR for AC1: https://github.com/os-autoinst/openQA/pull/5242

Actions

Copy link

Updated by mkittler almost 2 years ago

PR for AC2: https://github.com/os-autoinst/openQA/pull/5244

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-07-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` to Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from In Progress to Feedback

Minion job cache_asset failed and jobs runs into setup exceeded MAX_SETUP_TIME

Observation¶

The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME with the following logs:

[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt

The Minion job responsible for the download failed with:

---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
  lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20

Considering

martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago

this download job resulting in Job terminated unexpectedly is because openqa-worker-cacheservice-minion.service has been restarted.

Acceptance criteria¶

What we observed is likely not completely avoidable but I think we can do the following improvements:

AC1: The cache_asset minion jobs are avoided to run into Job terminated unexpectedly as much as possible, e.g. by scheduling a retry of the minion job and terminating immediately when receiving SIGTERM
AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.

Actions

Copy link

Updated by mkittler almost 2 years ago

Status changed from Feedback to Resolved

The PRs have been merged. Considering this was hard enough to reproduce locally I wouldn't wait for encountering such a case again in production and resolve the ticket right now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #132434

Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M

Observation¶

Acceptance criteria¶

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Observation¶

Acceptance criteria¶

Updated by mkittler almost 2 years ago