Project

General

Profile

Actions

action #167797

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-10-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3187771
shows https://openqa.opensuse.org/tests/4532764 incomplete which shows

Reason: cache failure: Cache service status error from API: Minion job #357644 failed: Job terminated unexpectedly (exit code: 0, signal: 15) [Auto-restarting because reason matches the configured "auto_clone_regex".] 

The clone https://openqa.opensuse.org/tests/4532766 was fine so auto-cloning helped but we should look into why the minion job failed and also think about if we can improve so that a tracked openQA job wouldn't fail.

Acceptance criteria

  • AC1: Given openQA jobs are running And cache service requests are still pending When the cache service is requested to be terminated Then openQA jobs do not incomplete

Suggestions


Related issues 4 (2 open2 closed)

Related to openQA Project (public) - action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler2021-12-02

Actions
Related to openQA Project (public) - action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew2022-03-25

Actions
Related to openQA Project (public) - action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

Actions
Has duplicate openQA Infrastructure (public) - action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3Rejected2024-10-08

Actions
Actions #1

Updated by tinita 2 months ago · Edited

Not sure about the number #357644 in the message but this seems to be the according minion job:
https://openqa.opensuse.org/minion/jobs?id=4386796

It looks like the gru service was restarted around that time:

result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
journalctl -u openqa-gru --since '2024-10-04'
...
Oct 04 11:50:43 ariel systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Oct 04 11:50:43 ariel openqa-gru[10607]: timeout: sending signal TERM to command ‘env’
Oct 04 11:50:43 ariel openqa-gru[10009]: timeout: sending signal TERM to command ‘env’
Oct 04 11:50:49 ariel systemd[1]: openqa-gru.service: Deactivated successfully.
Oct 04 11:50:49 ariel systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 04 11:50:49 ariel systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
/var/log/openqa_gru
[2024-10-04T11:50:38.592815Z] [debug] [#4386795] Updating '/var/lib/openqa/share/tests/opensuse' to 'https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#a987264cb87a9982098c4797530d3ac2265a6c79'
[2024-10-04T11:50:38.594292Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse remote get-url origin
[2024-10-04T11:50:38.629396Z] [debug] [pid:10594] https://github.com/os-autoinst/os-autoinst-distri-opensuse.git

[2024-10-04T11:50:38.629508Z] [debug] [pid:10594] 
[2024-10-04T11:50:38.629560Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.630894Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse diff-index HEAD --exit-code
[2024-10-04T11:50:38.694097Z] [debug] [pid:10594] 
[2024-10-04T11:50:38.694321Z] [debug] [pid:10594] 
[2024-10-04T11:50:38.694385Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.694670Z] [info] [pid:10594] Running cmd: git -C /var/lib/openqa/share/tests/opensuse branch --show-current
[2024-10-04T11:50:38.719500Z] [debug] [pid:10594] master

[2024-10-04T11:50:38.719602Z] [debug] [pid:10594] 
[2024-10-04T11:50:38.719647Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:38.719891Z] [info] [pid:10594] Running cmd: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin a987264cb87a9982098c4797530d3ac2265a6c79
[2024-10-04T11:50:39.393781Z] [debug] [pid:10594] 
[2024-10-04T11:50:39.393989Z] [debug] [pid:10594] From https://github.com/os-autoinst/os-autoinst-distri-opensuse
 * branch                a987264cb87a9982098c4797530d3ac2265a6c79 -> FETCH_HEAD
[2024-10-04T11:50:39.394066Z] [info] [pid:10594] cmd returned 0
[2024-10-04T11:50:42.752460Z] [debug] Process 11210 is performing job "4386796" with task "git_clone"
[2024-10-04T11:50:42.795974Z] [debug] [#4386796] Updating '/var/lib/openqa/share/tests/opensuse' to 'https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#a987264cb87a9982098c4797530d3ac2265a6c79'
[2024-10-04T11:50:42.797282Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse remote get-url origin
[2024-10-04T11:50:42.846267Z] [debug] [pid:11210] https://github.com/os-autoinst/os-autoinst-distri-opensuse.git

[2024-10-04T11:50:42.846395Z] [debug] [pid:11210] 
[2024-10-04T11:50:42.846447Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.855957Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse diff-index HEAD --exit-code
[2024-10-04T11:50:42.914095Z] [debug] [pid:11210] 
[2024-10-04T11:50:42.914186Z] [debug] [pid:11210] 
[2024-10-04T11:50:42.914233Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.915033Z] [info] [pid:11210] Running cmd: git -C /var/lib/openqa/share/tests/opensuse branch --show-current
[2024-10-04T11:50:42.937050Z] [debug] [pid:11210] master

[2024-10-04T11:50:42.937144Z] [debug] [pid:11210] 
[2024-10-04T11:50:42.937205Z] [info] [pid:11210] cmd returned 0
[2024-10-04T11:50:42.937366Z] [info] [pid:11210] Running cmd: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin a987264cb87a9982098c4797530d3ac2265a6c79
# <----------------- note: no "cmd returned ..." here
[2024-10-04T11:50:48.782078Z] [info] Worker 30673 stopped

Maybe that should be handled in a better way?

Actions #2

Updated by tinita 2 months ago

  • Description updated (diff)

We already use a SignalGuard elsewhere: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Job/FinalizeResults.pm#L15

    my $ensure_task_retry_on_termination_signal_guard = OpenQA::Task::SignalGuard->new($minion_job);

Should be easy to add that to the git_clone task as well.
We just need to be sure that it's ok to abort the series of git commands at any time without leaving it in a broken state, but I think that's the case here.

Actions #3

Updated by tinita 2 months ago

  • Related to action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M added
Actions #4

Updated by tinita 2 months ago

  • Related to action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download added
Actions #5

Updated by tinita 2 months ago

  • Related to action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule added
Actions #6

Updated by tinita 2 months ago

  • Parent task set to #99831
Actions #7

Updated by tinita 2 months ago

  • Related to action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3 added
Actions #8

Updated by tinita 2 months ago

  • Description updated (diff)
Actions #9

Updated by tinita 2 months ago

  • Description updated (diff)
Actions #10

Updated by tinita 2 months ago

  • Related to deleted (action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3)
Actions #11

Updated by tinita 2 months ago

  • Has duplicate action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3 added
Actions #12

Updated by okurz 2 months ago

  • Priority changed from Normal to High
Actions #13

Updated by okurz 2 months ago

  • Subject changed from scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" to scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #14

Updated by mkittler 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #15

Updated by mkittler 2 months ago · Edited

I tried to reproduce this locally but couldn't. If I restart the Minion service while a download job is active (or explicitly send SIGTERM to the active job via the Minion dashboard) it goes into the inactive state. It is retried and even if I repeat this multiple times it can eventually finish and the openQA job can run. Looking at the code it also looks like the signal guard we want to use is already in place. That explains why it generally works.

This job we saw in production was maybe really caused by the signal arriving right before the signal guard was registered. Not sure how to fix that; we'd probably needed an upstream Minion change for that. This change would allow us to add a SIGTERM handler before the actual job runs in some pre-run callback. If SIGTERM is received while the pre-run callback is executed the Minion framework would just shutdown and not execute the job (so the job never leaves the inactive state). Maybe using the start event for the signal guard registration would actually do the trick?

Actions #16

Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

I asked @kraih on the chat.

Actions #17

Updated by mkittler 2 months ago · Edited

As mentioned by @tinita on the chat this was actually about a cache_tests task where we don't use the signal guard. So just adding one will probably be good enough: https://github.com/os-autoinst/openQA/pull/5999

Actions #18

Updated by mkittler 2 months ago

  • Status changed from Feedback to Resolved

The PR has been merged.

Actions

Also available in: Atom PDF