action #107533
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:M
Description
Acceptance criteria¶
- AC1: minion job "finalize_job_results" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments (or service restarts) for "finalize_job_results"
Suggestions¶
- Implement sigterm handler for "finalize_job_results", similar as we did for example in https://github.com/os-autoinst/openQA/pull/4415/files
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed, specifically https://openqa.suse.de/minion/jobs?state=finished&offset=0&task=finalize_job_results
Updated by okurz almost 3 years ago
- Copied from action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added
Updated by mkittler almost 3 years ago
Maybe the one archiving job I found today was killed as it took too long to handle SIGTERM because it was already at the point where everything was copied. At this point we continue regardless of the signal because it only needs to update one database row and delete the old results directory which both shouldn't take very long. I implemented it in this way to avoid leftover result directories for archived jobs so I wouldn't change it to abort here. Maybe we can extend the timeout before the jobs are killed.
Updated by mkittler almost 3 years ago
- Status changed from New to In Progress
Updated by openqa_review almost 3 years ago
- Due date set to 2022-03-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
Updated by mkittler over 2 years ago
The change has been deployed today (11:12, CET). It looks good so far, the last jobs on https://openqa.suse.de/minion/jobs?state=failed&task=finalize_job_results are from 4 days ago.
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
It looks still good, no further occurrences. So I'm resolving the issue.