action #99741
closedMinion jobs for job hooks failed silently on o3 size:M
0%
Description
Minion jobs for job hooks failed silently on o3
Observation¶
Gru logs show entries like this for minion jobs:
Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.
Relevant minion jobs are shown as finished rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details:
---
args:
- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
notes:
gru_id: 17752756
hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
/opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744
Acceptance criteria¶
- AC1: Alerts are received for both osd+o3 if a high (configurable?) amount (or ratio) of hook scripts fail
Suggestions¶
- Read lib/OpenQA/Task/Job/FinalizeResults.pm
- Put a postgres query like
select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;
into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L110 - Don't query too frequently (once an hour maybe), since any query will have to be rather inefficient
- Consider a solution for o3 as well, e.g. error in log that openqa_logwarn would alert us about
- Optional: Add an according grafana panel with alert and description
Updated by livdywan almost 3 years ago
- Copied from action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:M added
Updated by okurz almost 3 years ago
- Related to action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectively added
Updated by okurz almost 3 years ago
- Priority changed from High to Normal
- Target version changed from Ready to future
When talking about "o3" what do mean with "alert"? We don't have grafana if this is what you might confuse it with. There would be #57239 which could have helped. Taking ticket out of backlog again. We have 102 tickets in backlog which is too much.
Updated by livdywan almost 3 years ago
- Subject changed from Failed minion jobs for job hooks didn't cause any alerts to Minion jobs for job hooks failed silently on o3
- Description updated (diff)
When talking about "o3" what do mean with "alert"? We don't have grafana
My mistake. I didn't mean to prescribe an implementation. The point is that these are completely silent failures.
Updated by tinita almost 3 years ago
Just a note on how to find out about failing hook scripts manually, as we have the hook_rc
now:
select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 order by created;
Updated by okurz over 1 year ago
- Related to action #128405: Missing investigate jobs on both o3+osd since months? size:M added
Updated by kraih over 1 year ago
tinita wrote:
Just a note on how to find out about failing hook scripts manually, as we have the
hook_rc
now:select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 order by created;
But be aware that this query would use a sequential scan, so maybe don't run it too frequently:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Sort (cost=8101.87..8203.47 rows=40638 width=146) (actual time=52.812..53.434 rows=2092 loops=1)
Sort Key: created
Sort Method: quicksort Memory: 2221kB
-> Seq Scan on minion_jobs (cost=0.00..4990.92 rows=40638 width=146) (actual time=0.061..47.925 rows=2092 loops=1)
Filter: (((notes -> 'hook_rc'::text))::integer <> 0)
Rows Removed by Filter: 39143
Planning Time: 0.156 ms
Execution Time: 53.651 ms
(8 rows)
Updated by okurz over 1 year ago
- Subject changed from Minion jobs for job hooks failed silently on o3 to Minion jobs for job hooks failed silently on o3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan over 1 year ago
btw this is the "telegraf" route https://openqa.opensuse.org/admin/influxdb/minion
Updated by dheidler over 1 year ago
use DateTime;
$end = DateTime->now;
$end->subtract(minutes => ($dt->minute() % 10), seconds => $dt->second());
$start = $end->clone()->subtract(minutes => 10);
Updated by tinita over 1 year ago
for the record, there is also DateTime->now->truncate( to => 'minute' )
to cut off the seconds, but if you are subtracting minutes there, it makes sense to subtract the seconds in the same call.
Also I think the query can be further reduced to hook_script tasks, e.g. where task = 'hook_script'
Updated by dheidler over 1 year ago
- Status changed from Workable to Feedback
Updated by dheidler over 1 year ago
2do after merge: add graph in grafana including alert.
Updated by tinita over 1 year ago
For the record: I looked into the current gru journal on o3. There are some recent grep timeouts, however, no hook scripts are marked with a nonzero exit code in the minion database.
I looked into the code to check the logic again, and currently if the search_log()
call fails, the script is just returning and forgetting about the search_log
exit code: https://github.com/os-autoinst/scripts/blob/master/_common#L106
We might want to change that at some point by checking on the return code in label-on-issue
itself. Only the exit codes 0 (grep was successful) and 1 (grep didn't find something) are expected.
Updated by livdywan over 1 year ago
Let's wait for the deployment so we have some data to test the alert itself to be implemented.
Updated by okurz over 1 year ago
dheidler wrote:
2do after merge: add graph in grafana including alert.
that does not make any sense as there is no grafana for o3
Updated by livdywan over 1 year ago
okurz wrote:
dheidler wrote:
2do after merge: add graph in grafana including alert.
that does not make any sense as there is no grafana for o3
AC's and suggestions include o3 and osd.
cdywan wrote:
btw this is the "telegraf" route https://openqa.opensuse.org/admin/influxdb/minion
Apparently the changes are effective:
openqa_minion_jobs,url=https://openqa.opensuse.org active=1i,delayed=0i,failed=352i,inactive=0i
openqa_minion_jobs_hook_rc_failed,url=https://openqa.opensuse.org rc_failed_per_10min=0i 16860372000000000
openqa_minion_workers,url=https://openqa.opensuse.org active=1i,inactive=0i,registered=1i
The route also works on osd:
openqa_minion_jobs,url=https://openqa.suse.de active=0i,delayed=0i,failed=3i,inactive=0i
openqa_minion_jobs_hook_rc_failed,url=https://openqa.suse.de rc_failed_per_10min=0i 16860372000000000
openqa_minion_workers,url=https://openqa.suse.de active=0i,inactive=1i,registered=1i
Updated by dheidler over 1 year ago
Updated by livdywan over 1 year ago
Discussed in the Unblock:
Why don't we use logwarn?
- Do we always want to be alerted vs. a threshold?
- Gru logs into the journal. That's not visible to logwarn, is it?
- What about grafana on o3? There seems to be an instance that we weren't aware of (metrics.opensuse.org)
Why do we care to have more advanced monitoring on o3 now?
- We have Dimstar
- Dimstar has no plugin to monitor investigate jobs
- Fixing things up as needed based on user feedback is considered okay for o3
- Back to basics - simple options
- Forget about the ratio for o3
- Get logwarn to pick up logs from the journal OR move logging to files?
- Try and write a munin plugin - this could be a super simple bash script written in minutes (example plugin for minion_jobs
- sending emails from o3 is already possible and used e.g. by logwarn called by cron, see /etc/cron.d/logwarn_openqa which means that simply cron sends any stdout by email to o3-admins@suse.de
Updated by livdywan about 1 year ago
cdywan wrote:
- Try and write a munin plugin - this could be a super simple bash script written in minutes ([example plugin for minion_jobs](https://gist.github.com/perlpunk/4aec2d85c51b22261c0717afc896aa4c)
Updated by tinita about 1 year ago
I'm fighting with getting alert emails to work with my local munin. The documentation about that is a bit confusing, and while I've written various munin plugins, I never configured email notifications myself.
Updated by dheidler about 1 year ago
Adding grafana chart: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/882
Updated by tinita about 1 year ago
- Copied to action #130778: Treat some openqa-clone-job failures non-fatal in openqa-investigate hook added
Updated by dheidler about 1 year ago
Updated by tinita about 1 year ago
I installed the munin plugin on o3, but for now it points to a clone under my personal user, because the openQA zypper package doesn't contain the contrib folder.
Create the tunnel ssh -L 8080:localhost:80 o3
and view the data:
http://127.0.0.1:8080/munin/minion-day.html
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5210 munin: Make alert thresholds configurable
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5215 Add subpackage openQA-munin
Updated by tinita about 1 year ago
Installed openQA-munin package on o3. For configuration see /etc/munin/plugin-conf.d/openqa-minion
Updated by okurz about 1 year ago
- Status changed from Feedback to Resolved
We have grafana for OSD and munin with email sending on o3. We assume we are good now.
Updated by jbaier_cz about 1 year ago
- Related to action #132665: [alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S added