action #132434
closedMinion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M
Description
Observation¶
The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME
with the following logs:
[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt
The Minion job responsible for the download failed with:
---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20
Considering
martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago
this download job resulting in Job terminated unexpectedly
is because openqa-worker-cacheservice-minion.service
has been restarted.
Acceptance criteria¶
What we observed is likely not completely avoidable but I think we can do the following improvements:
- AC1: The
cache_asset
jobs are avoided to run intoJob terminated unexpectedly
as much as possible, e.g. by scheduling a retry of the job and terminating immediately when receiving SIGTERM. - AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.
Updated by okurz over 1 year ago
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
PR for AC1: https://github.com/os-autoinst/openQA/pull/5242
Updated by mkittler over 1 year ago
PR for AC2: https://github.com/os-autoinst/openQA/pull/5244
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Subject changed from Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` to Minion job `cache_asset` failed and jobs runs into `setup exceeded MAX_SETUP_TIME` size:M
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
Minion job cache_asset
failed and jobs runs into setup exceeded MAX_SETUP_TIME
Observation¶
The job https://openqa.suse.de/tests/11527601 ran into timeout: setup exceeded MAX_SETUP_TIME
with the following logs:
[2023-07-07T07:29:01.938121+02:00] [info] [pid:5562] +++ setup notes +++
[2023-07-07T07:29:01.938328+02:00] [info] [pid:5562] Running on worker5:15 (Linux 5.14.21-150400.24.66-default #1 SMP PREEMPT_DYNAMIC Tue Jun 6 10:18:38 UTC 2023 (98adc02) x86_64)
[2023-07-07T07:29:01.953155+02:00] [debug] [pid:5562] Found HDD_1, caching publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
[2023-07-07T07:29:01.958611+02:00] [info] [pid:5562] Downloading publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2, request #11899 sent to Cache Service
[2023-07-07T08:29:01.976728+02:00] [info] [pid:5562] +++ worker notes +++
[2023-07-07T08:29:01.977134+02:00] [info] [pid:5562] End time: 2023-07-07 06:29:01
[2023-07-07T08:29:01.977240+02:00] [info] [pid:5562] Result: timeout
[2023-07-07T08:29:01.984928+02:00] [info] [pid:32345] Uploading autoinst-log.txt
The Minion job responsible for the download failed with:
---
args:
- 11527601
- hdd
- publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2023-07-07T05:29:01Z
delayed: 2023-07-07T05:29:01Z
expires: ~
finished: 2023-07-07T05:30:28Z
id: 11899
lax: 0
notes:
lock: publiccloud_15-SP5_GCE-BYOS-Updates-x86_64.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2023-07-07T05:29:28Z
state: failed
task: cache_asset
time: 2023-07-07T08:41:32Z
worker: 20
Considering
martchus@worker5:~> sudo systemctl status openqa-worker-cacheservice-minion.service
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-07-07 07:30:29 CEST; 3h 16min ago
this download job resulting in Job terminated unexpectedly
is because openqa-worker-cacheservice-minion.service
has been restarted.
Acceptance criteria¶
What we observed is likely not completely avoidable but I think we can do the following improvements:
- AC1: The
cache_asset
minion jobs are avoided to run intoJob terminated unexpectedly
as much as possible, e.g. by scheduling a retry of the minion job and terminating immediately when receiving SIGTERM - AC2: The worker does not wait the whole setup timeout for a download when the corresponding Minion job as failed anyways. The job is instead set to a state/result that will be automatically restarted by the web UI.
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
The PRs have been merged. Considering this was hard enough to reproduce locally I wouldn't wait for encountering such a case again in production and resolve the ticket right now.