action #179885
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone
Description
Acceptance criteria¶
- AC1: minion job "git_clone" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "git_clone"
Suggestions¶
- Implement sigterm handler for "download_asset" like it has already been done for other jobs (e.g. #103416)
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
- (Note: I don't know who suggested this but restarting openqa-gru multiple times in production can result in the actual problem, and then we are left with incomplete jobs, so try to do it in a local instance or at least check and repair things if necessary)
Updated by tinita about 2 months ago
- Copied from action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::Download size:S added
Updated by tinita about 2 months ago
- Related to action #179873: Long delays due to "git_clone" tasks added
Updated by okurz about 2 months ago
- Target version changed from Tools - Next to Ready
Updated by okurz about 2 months ago
- Target version changed from Ready to Tools - Next
Updated by tinita about 2 months ago
- Target version changed from Tools - Next to Ready
Updated by tinita about 2 months ago
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone size:S to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Git::Clone git_clone
- Status changed from Workable to New
Updated by okurz about 2 months ago
- Target version changed from Ready to Tools - Next
Updated by tinita 13 days ago
- Related to action #182225: o3 jobs stuck on "git_clone" for days added
Updated by livdywan 12 days ago
mkittler wrote in #note-14:
AC2 mentions "multiple deployments" but we don't usually want to keep tickets in Feedback for a long time just to wait on long-term results. Meaning this should only be re-opened if that isn't the case.
@mkittler Did you want to do additional manual testing? Otherwise I would suggest to resolve as we usually do?
Updated by mkittler 11 days ago
- Status changed from Feedback to Resolved
No, I didn't test it as the change is trivial considering we just add one line of code here (than enables functionality which is tested). I guess we can indeed resolve this. If we see an alert due to jobs failing like this again we can reopen the ticket.