Project

General

Custom queries

Profile

Actions

action #108980

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download size:S

Added by mkittler about 3 years ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-03-25
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: minion job "download_asset" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "download_asset"

Suggestions

  • Implement sigterm handler for "download_asset" like it has already been done for other jobs (e.g. #103416)
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 3 (2 open1 closed)

Related to openQA Project (public) - action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:MResolvedmkittler2024-10-04

Actions
Copied to openQA Project (public) - action #169144: Link to minion job from openQA job with waiting task size:SWorkable

Actions
Copied to openQA Project (public) - action #179885: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone size:SNew2025-04-02

Actions
Actions #1

Updated by mkittler about 3 years ago

  • Category set to Feature requests
Actions #2

Updated by okurz about 3 years ago

  • Target version set to future
Actions #3

Updated by tinita 6 months ago

  • Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Actions #4

Updated by tinita 5 months ago

  • Copied to action #169144: Link to minion job from openQA job with waiting task size:S added
Actions #5

Updated by okurz 19 days ago

  • Target version changed from future to Tools - Next
Actions #6

Updated by okurz 18 days ago

  • Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download size:S
  • Status changed from New to Workable
Actions #7

Updated by tinita 5 days ago

  • Copied to action #179885: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone size:S added
Actions #8

Updated by okurz 5 days ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by ybonatakis 4 days ago

  • Assignee set to ybonatakis
Actions #10

Updated by ybonatakis 4 days ago

  • Status changed from Workable to In Progress
Actions #11

Updated by ybonatakis 4 days ago

  • Status changed from In Progress to Feedback

I attempted to test it following those steps:
I went to https://openqa.opensuse.org/minion/jobs and I picked the first task from download_asset. That was https://openqa.opensuse.org/minion/jobs?id=5087207

I remove the obs-server.x86_64-2.10.51-qcow2-Build40.10.qcow2 which was the asset from jobs?id=5087207 on o3
and then back to the minion UI I restart the job while I run for i in {1..10}; do echo $i; systemctl restart openqa-gru; sleep 30;done

This prodiced a failed job , which I could see in https://openqa.opensuse.org/minion/jobs?task=download_asset&state=failed&queue=&note=
or https://openqa.opensuse.org/minion/jobs?id=5087207. with the error "Job terminated unexpectedly"

Subsequently, I patch the /usr/share/openqa/lib/OpenQA/Task/Asset/Download.pm with the changes from b231c356074ed5498a473a15006e1690c9638e4c
and I repeated the rm qwow, restart minion job and loop to restart the gru service. The job succeeded (check again the UI) and found the qcow on the server.

I did this a few times, playing also with the sleep value.
I didnt found any logs in openqa-gru service tho. I was looking for "Downloading...."

Anyway, I think it was successful.

Note: I tried first to find a test which could help but they cant be used in this case. I wish there was an automation to trigger the test.

Actions #12

Updated by tinita 4 days ago

Did you do this on o3?

Note that repeatedly restarting the gru service can result in more of those failed jobs, so
https://openqa.opensuse.org/minion/jobs?id=5087436
https://openqa.opensuse.org/minion/jobs?id=5087396
could possibly come from your restarts?

If possible, try out such things on your local instance instead.

Actions #13

Updated by ybonatakis 3 days ago

I dont

tinita wrote in #note-12:

Did you do this on o3?

Note that repeatedly restarting the gru service can result in more of those failed jobs, so
https://openqa.opensuse.org/minion/jobs?id=5087436
https://openqa.opensuse.org/minion/jobs?id=5087396
could possibly come from your restarts?

If possible, try out such things on your local instance instead.

I dont know if they are both from my experiments but it is possible. I admit that I ignored my local instance and it would be easier to test it there at the end.

Actions #14

Updated by ybonatakis 3 days ago

  • Status changed from Feedback to Resolved

merged and all green in minion dashboard

Actions

Also available in: Atom PDF