action #167797
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3187771
shows https://openqa.opensuse.org/tests/4532764 incomplete which shows
Reason: cache failure: Cache service status error from API: Minion job #357644 failed: Job terminated unexpectedly (exit code: 0, signal: 15) [Auto-restarting because reason matches the configured "auto_clone_regex".]
The clone https://openqa.opensuse.org/tests/4532766 was fine so auto-cloning helped but we should look into why the minion job failed and also think about if we can improve so that a tracked openQA job wouldn't fail.
Acceptance criteria¶
- AC1: Given openQA jobs are running And cache service requests are still pending When the cache service is requested to be terminated Then openQA jobs do not incomplete
Suggestions¶
- Use SignalGuard like here: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Job/FinalizeResults.pm#L15 in
- Regarding the number
Minion job #357644
: that number is actually the minion job id of the cache service minion server running on the client with a sqlite database. and the actual task iscache_tests
, notgit_clone
, so it's https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/CacheService/Task/Sync.pm
Updated by tinita 2 months ago · Edited
Not sure about the number #357644
in the message but this seems to be the according minion job:
https://openqa.opensuse.org/minion/jobs?id=4386796
It looks like the gru service was restarted around that time:
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
journalctl -u openqa-gru --since '2024-10-04'
...
Oct 04 11:50:43 ariel systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Oct 04 11:50:43 ariel openqa-gru[10607]: timeout: sending signal TERM to command ‘env’
Oct 04 11:50:43 ariel openqa-gru[10009]: timeout: sending signal TERM to command ‘env’
Oct 04 11:50:49 ariel systemd[1]: openqa-gru.service: Deactivated successfully.
Oct 04 11:50:49 ariel systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 04 11:50:49 ariel systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
/var/log/openqa_gru
[2024-10-04T11:50:38.592815Z] [debug] [#4386795] Updating '/var/lib/openqa/share/tests/opensuse' to 'https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#a987264cb87a9982098c4797530d3ac2265a6c79'
[2024-10-04T11:50:38.594292Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse remote get-url origin
[2024-10-04T11:50:38.629396Z] [debug] [pid:10594] https://github.com/os-autoinst/os-autoinst-distri-opensuse.git
[2024-10-04T11:50:38.629508Z] [debug] [pid:10594]
[2024-10-04T11:50:38.629560Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.630894Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse diff-index HEAD --exit-code
[2024-10-04T11:50:38.694097Z] [debug] [pid:10594]
[2024-10-04T11:50:38.694321Z] [debug] [pid:10594]
[2024-10-04T11:50:38.694385Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.694670Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse branch --show-current
[2024-10-04T11:50:38.719500Z] [debug] [pid:10594] master
[2024-10-04T11:50:38.719602Z] [debug] [pid:10594]
[2024-10-04T11:50:38.719647Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.719891Z] [info] [pid:10594] Running cmd: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin a987264cb87a9982098c4797530d3ac2265a6c79
[2024-10-04T11:50:39.393781Z] [debug] [pid:10594]
[2024-10-04T11:50:39.393989Z] [debug] [pid:10594] From https://github.com/os-autoinst/os-autoinst-distri-opensuse
* branch a987264cb87a9982098c4797530d3ac2265a6c79 -> FETCH_HEAD
[2024-10-04T11:50:39.394066Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:42.752460Z] [debug] Process 11210 is performing job "4386796" with task "git_clone"
[2024-10-04T11:50:42.795974Z] [debug] [#4386796] Updating '/var/lib/openqa/share/tests/opensuse' to 'https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#a987264cb87a9982098c4797530d3ac2265a6c79'
[2024-10-04T11:50:42.797282Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse remote get-url origin
[2024-10-04T11:50:42.846267Z] [debug] [pid:11210] https://github.com/os-autoinst/os-autoinst-distri-opensuse.git
[2024-10-04T11:50:42.846395Z] [debug] [pid:11210]
[2024-10-04T11:50:42.846447Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.855957Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse diff-index HEAD --exit-code
[2024-10-04T11:50:42.914095Z] [debug] [pid:11210]
[2024-10-04T11:50:42.914186Z] [debug] [pid:11210]
[2024-10-04T11:50:42.914233Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.915033Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse branch --show-current
[2024-10-04T11:50:42.937050Z] [debug] [pid:11210] master
[2024-10-04T11:50:42.937144Z] [debug] [pid:11210]
[2024-10-04T11:50:42.937205Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.937366Z] [info] [pid:11210] Running cmd: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin a987264cb87a9982098c4797530d3ac2265a6c79
# <----------------- note: no "cmd returned ..." here
[2024-10-04T11:50:48.782078Z] [info] Worker 30673 stopped
Maybe that should be handled in a better way?
Updated by tinita 2 months ago
- Description updated (diff)
We already use a SignalGuard elsewhere: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Job/FinalizeResults.pm#L15
my $ensure_task_retry_on_termination_signal_guard = OpenQA::Task::SignalGuard->new($minion_job);
Should be easy to add that to the git_clone task as well.
We just need to be sure that it's ok to abort the series of git commands at any time without leaving it in a broken state, but I think that's the case here.
Updated by tinita 2 months ago
- Related to action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M added
Updated by tinita 2 months ago
- Related to action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download added
Updated by tinita 2 months ago
- Related to action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule added
Updated by tinita 2 months ago
- Related to action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3 added
Updated by tinita 2 months ago
- Related to deleted (action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3)
Updated by tinita 2 months ago
- Has duplicate action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3 added
Updated by mkittler 2 months ago · Edited
I tried to reproduce this locally but couldn't. If I restart the Minion service while a download job is active (or explicitly send SIGTERM to the active job via the Minion dashboard) it goes into the inactive state. It is retried and even if I repeat this multiple times it can eventually finish and the openQA job can run. Looking at the code it also looks like the signal guard we want to use is already in place. That explains why it generally works.
This job we saw in production was maybe really caused by the signal arriving right before the signal guard was registered. Not sure how to fix that; we'd probably needed an upstream Minion change for that. This change would allow us to add a SIGTERM handler before the actual job runs in some pre-run callback. If SIGTERM is received while the pre-run callback is executed the Minion framework would just shutdown and not execute the job (so the job never leaves the inactive state). Maybe using the start event for the signal guard registration would actually do the trick?