coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
- AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"
- Implement sigterm handler for "limit_results_and_logs"
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
#1 Updated by cdywan about 2 months ago
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
- Status changed from New to Workable
#6 Updated by mkittler about 1 month ago
https://openqa.suse.de/minion/jobs?task=limit_results_and_logs looks good so far. I've also checking on o3 where it looks good as well. (Then I have been cleaning up the Minion dashboard on o3 as there were over 400 failed jobs.)
#7 Updated by mkittler about 1 month ago
Looks like it works, there's already a retried Minion job (with the corresponding note) on OSD: https://openqa.suse.de/minion/jobs?id=3588093
Applying the same approach for other cleanup jobs is easy so I've just created a PR (not strictly part of this ticket): https://github.com/os-autoinst/openQA/pull/4396