action #108980
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download size:S
0%
Description
Acceptance criteria¶
- AC1: minion job "download_asset" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "download_asset"
Suggestions¶
- Implement sigterm handler for "download_asset" like it has already been done for other jobs (e.g. #103416)
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
Updated by tinita 6 months ago
- Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Updated by tinita 5 months ago
- Copied to action #169144: Link to minion job from openQA job with waiting task size:S added
Updated by tinita 5 days ago
- Copied to action #179885: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone size:S added
Updated by ybonatakis 4 days ago
- Status changed from Workable to In Progress
Updated by ybonatakis 4 days ago
- Status changed from In Progress to Feedback
I attempted to test it following those steps:
I went to https://openqa.opensuse.org/minion/jobs and I picked the first task from download_asset. That was https://openqa.opensuse.org/minion/jobs?id=5087207
I remove the obs-server.x86_64-2.10.51-qcow2-Build40.10.qcow2 which was the asset from jobs?id=5087207 on o3
and then back to the minion UI I restart the job while I run for i in {1..10}; do echo $i; systemctl restart openqa-gru; sleep 30;done
This prodiced a failed job , which I could see in https://openqa.opensuse.org/minion/jobs?task=download_asset&state=failed&queue=¬e=
or https://openqa.opensuse.org/minion/jobs?id=5087207
. with the error "Job terminated unexpectedly"
Subsequently, I patch the /usr/share/openqa/lib/OpenQA/Task/Asset/Download.pm with the changes from b231c356074ed5498a473a15006e1690c9638e4c
and I repeated the rm qwow, restart minion job and loop to restart the gru service. The job succeeded (check again the UI) and found the qcow on the server.
I did this a few times, playing also with the sleep value.
I didnt found any logs in openqa-gru service tho. I was looking for "Downloading...."
Anyway, I think it was successful.
Note: I tried first to find a test which could help but they cant be used in this case. I wish there was an automation to trigger the test.
Updated by tinita 4 days ago
Did you do this on o3?
Note that repeatedly restarting the gru service can result in more of those failed jobs, so
https://openqa.opensuse.org/minion/jobs?id=5087436
https://openqa.opensuse.org/minion/jobs?id=5087396
could possibly come from your restarts?
If possible, try out such things on your local instance instead.
Updated by ybonatakis 3 days ago
I dont
tinita wrote in #note-12:
Did you do this on o3?
Note that repeatedly restarting the gru service can result in more of those failed jobs, so
https://openqa.opensuse.org/minion/jobs?id=5087436
https://openqa.opensuse.org/minion/jobs?id=5087396
could possibly come from your restarts?If possible, try out such things on your local instance instead.
I dont know if they are both from my experiments but it is possible. I admit that I ignored my local instance and it would be easier to test it there at the end.
Updated by ybonatakis 3 days ago
- Status changed from Feedback to Resolved
merged and all green in minion dashboard