Actions
action #107533
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:M
Description
Acceptance criteria¶
- AC1: minion job "finalize_job_results" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments (or service restarts) for "finalize_job_results"
Suggestions¶
- Implement sigterm handler for "finalize_job_results", similar as we did for example in https://github.com/os-autoinst/openQA/pull/4415/files
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed, specifically https://openqa.suse.de/minion/jobs?state=finished&offset=0&task=finalize_job_results
Actions