action #99741
Updated by okurz over 1 year ago
Minion jobs for job hooks failed silently on o3 ## Observation Gru logs show entries like this for minion jobs: Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63. Relevant minion jobs are shown as **finished** rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details: ```yaml --- args: - 1951060 - ~ attempts: 1 children: [] created: 2021-10-02T13:46:14.14573Z delayed: 2021-10-02T13:46:14.14573Z expires: ~ finished: 2021-10-02T13:46:14.41935Z id: 800152 lax: 0 notes: gru_id: 17752756 hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook hook_rc: -1 parents: [] priority: -10 queue: default result: Job successfully executed retried: ~ retries: 0 started: 2021-10-02T13:46:14.15145Z state: finished task: finalize_job_results time: 2021-10-04T13:59:22.27403Z worker: 744 ``` ## Acceptance criteria * **AC1:** Alerts are received for both osd+o3 if a high (configurable?) amount (or ratio) of hook scripts fail ## Suggestions * Read [lib/OpenQA/Task/Job/FinalizeResults.pm](https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Job/FinalizeResults.pm#L63) * Put a postgres query like `select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;` into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L110 * Don't query too frequently (once an hour maybe), since any query will have to be rather inefficient * Consider a solution for o3 as well, e.g. error in log that openqa_logwarn would alert us about * Optional: Add an according grafana panel with alert and description