action #99741
Minion jobs for job hooks failed silently on o3
0%
Description
Observation¶
Gru logs show entries like this for minion jobs:
Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.
Relevant minion jobs are shown as finished rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details:
---
args:
- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
notes:
gru_id: 17752756
hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
/opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744
Acceptance criteria¶
- AC1: Alerts are received if a high amount (or ratio) of hook scripts fail
Suggestions¶
- Read lib/OpenQA/Task/Job/FinalizeResults.pm
- Put a postgres query like
select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;
into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L110 - Add an according grafana panel with alert and description
Related issues
History
#1
Updated by cdywan over 1 year ago
- Copied from action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:M added
#2
Updated by cdywan over 1 year ago
- Description updated (diff)
#3
Updated by okurz over 1 year ago
- Related to action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectively added
#4
Updated by okurz over 1 year ago
- Priority changed from High to Normal
- Target version changed from Ready to future
When talking about "o3" what do mean with "alert"? We don't have grafana if this is what you might confuse it with. There would be #57239 which could have helped. Taking ticket out of backlog again. We have 102 tickets in backlog which is too much.
#5
Updated by cdywan over 1 year ago
- Subject changed from Failed minion jobs for job hooks didn't cause any alerts to Minion jobs for job hooks failed silently on o3
- Description updated (diff)
When talking about "o3" what do mean with "alert"? We don't have grafana
My mistake. I didn't mean to prescribe an implementation. The point is that these are completely silent failures.
#6
Updated by tinita over 1 year ago
Just a note on how to find out about failing hook scripts manually, as we have the hook_rc
now:
select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 order by created;