action #99741

Updated by okurz about 1 year ago

## Observation

Gru logs show entries like this for minion jobs:

Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/ line 63.

Relevant minion jobs are shown as **finished** rather than "failed", e.g. with the following details:

- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
gru_id: 17752756
hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744

## Acceptance criteria
* **AC1:** Alerts are received if a high amount (or ratio) of hook scripts fail

## Suggestions
* Read [lib/OpenQA/Task/Job/](
* Put a postgres query like `select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;` into
* Add an according grafana panel with alert and description