Actions
action #99741
closedMinion jobs for job hooks failed silently on o3 size:M
Start date:
2021-10-04
Due date:
% Done:
0%
Estimated time:
Description
Minion jobs for job hooks failed silently on o3
Observation¶
Gru logs show entries like this for minion jobs:
Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.
Relevant minion jobs are shown as finished rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details:
---
args:
- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
notes:
gru_id: 17752756
hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
/opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744
Acceptance criteria¶
- AC1: Alerts are received for both osd+o3 if a high (configurable?) amount (or ratio) of hook scripts fail
Suggestions¶
- Read lib/OpenQA/Task/Job/FinalizeResults.pm
- Put a postgres query like
select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;
into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L110 - Don't query too frequently (once an hour maybe), since any query will have to be rather inefficient
- Consider a solution for o3 as well, e.g. error in log that openqa_logwarn would alert us about
- Optional: Add an according grafana panel with alert and description
Actions