Project

General

Profile

Actions

action #99741

closed

Minion jobs for job hooks failed silently on o3 size:M

Added by livdywan about 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-10-04
Due date:
% Done:

0%

Estimated time:

Description

Minion jobs for job hooks failed silently on o3

Observation

Gru logs show entries like this for minion jobs:

Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.

Relevant minion jobs are shown as finished rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details:

---
args:
- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
notes:
  gru_id: 17752756
  hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
    /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
  hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744

Acceptance criteria

  • AC1: Alerts are received for both osd+o3 if a high (configurable?) amount (or ratio) of hook scripts fail

Suggestions


Related issues 5 (2 open3 closed)

Related to openQA Infrastructure - action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectivelyWorkable2019-09-23

Actions
Related to openQA Project - action #128405: Missing investigate jobs on both o3+osd since months? size:MResolvedtinita2023-04-28

Actions
Related to openQA Project - action #132665: [alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:SResolvedtinita2023-07-13

Actions
Copied from openQA Infrastructure - action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:MResolvedlivdywan

Actions
Copied to openQA Infrastructure - action #130778: Treat some openqa-clone-job failures non-fatal in openqa-investigate hookNew

Actions
Actions #1

Updated by livdywan about 3 years ago

  • Copied from action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:M added
Actions #2

Updated by livdywan about 3 years ago

  • Description updated (diff)
Actions #3

Updated by okurz about 3 years ago

  • Related to action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectively added
Actions #4

Updated by okurz about 3 years ago

  • Priority changed from High to Normal
  • Target version changed from Ready to future

When talking about "o3" what do mean with "alert"? We don't have grafana if this is what you might confuse it with. There would be #57239 which could have helped. Taking ticket out of backlog again. We have 102 tickets in backlog which is too much.

Actions #5

Updated by livdywan about 3 years ago

  • Subject changed from Failed minion jobs for job hooks didn't cause any alerts to Minion jobs for job hooks failed silently on o3
  • Description updated (diff)

When talking about "o3" what do mean with "alert"? We don't have grafana

My mistake. I didn't mean to prescribe an implementation. The point is that these are completely silent failures.

Actions #6

Updated by tinita about 3 years ago

Just a note on how to find out about failing hook scripts manually, as we have the hook_rc now:

select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int)  != 0 order by created;
Actions #7

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #8

Updated by okurz over 1 year ago

  • Related to action #128405: Missing investigate jobs on both o3+osd since months? size:M added
Actions #9

Updated by okurz over 1 year ago

  • Target version changed from future to Ready
Actions #10

Updated by kraih over 1 year ago

tinita wrote:

Just a note on how to find out about failing hook scripts manually, as we have the hook_rc now:

select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int)  != 0 order by created;

But be aware that this query would use a sequential scan, so maybe don't run it too frequently:

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=8101.87..8203.47 rows=40638 width=146) (actual time=52.812..53.434 rows=2092 loops=1)
   Sort Key: created
   Sort Method: quicksort  Memory: 2221kB
   ->  Seq Scan on minion_jobs  (cost=0.00..4990.92 rows=40638 width=146) (actual time=0.061..47.925 rows=2092 loops=1)
         Filter: (((notes -> 'hook_rc'::text))::integer <> 0)
         Rows Removed by Filter: 39143
 Planning Time: 0.156 ms
 Execution Time: 53.651 ms
(8 rows)
Actions #11

Updated by okurz over 1 year ago

  • Subject changed from Minion jobs for job hooks failed silently on o3 to Minion jobs for job hooks failed silently on o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #12

Updated by livdywan over 1 year ago

btw this is the "telegraf" route https://openqa.opensuse.org/admin/influxdb/minion

Actions #13

Updated by dheidler over 1 year ago

  • Assignee set to dheidler
Actions #14

Updated by dheidler over 1 year ago

use DateTime;
$end = DateTime->now;
$end->subtract(minutes => ($dt->minute() % 10), seconds => $dt->second());
$start = $end->clone()->subtract(minutes => 10);
Actions #15

Updated by tinita over 1 year ago

for the record, there is also DateTime->now->truncate( to => 'minute' ) to cut off the seconds, but if you are subtracting minutes there, it makes sense to subtract the seconds in the same call.

Also I think the query can be further reduced to hook_script tasks, e.g. where task = 'hook_script'

Actions #16

Updated by dheidler over 1 year ago

  • Status changed from Workable to Feedback
Actions #17

Updated by dheidler over 1 year ago

2do after merge: add graph in grafana including alert.

Actions #18

Updated by tinita over 1 year ago

For the record: I looked into the current gru journal on o3. There are some recent grep timeouts, however, no hook scripts are marked with a nonzero exit code in the minion database.
I looked into the code to check the logic again, and currently if the search_log() call fails, the script is just returning and forgetting about the search_log exit code: https://github.com/os-autoinst/scripts/blob/master/_common#L106

We might want to change that at some point by checking on the return code in label-on-issue itself. Only the exit codes 0 (grep was successful) and 1 (grep didn't find something) are expected.

Actions #19

Updated by livdywan over 1 year ago

Let's wait for the deployment so we have some data to test the alert itself to be implemented.

Actions #20

Updated by okurz over 1 year ago

dheidler wrote:

2do after merge: add graph in grafana including alert.

that does not make any sense as there is no grafana for o3

Actions #21

Updated by livdywan over 1 year ago

okurz wrote:

dheidler wrote:

2do after merge: add graph in grafana including alert.

that does not make any sense as there is no grafana for o3

AC's and suggestions include o3 and osd.

cdywan wrote:

btw this is the "telegraf" route https://openqa.opensuse.org/admin/influxdb/minion

Apparently the changes are effective:

openqa_minion_jobs,url=https://openqa.opensuse.org active=1i,delayed=0i,failed=352i,inactive=0i
openqa_minion_jobs_hook_rc_failed,url=https://openqa.opensuse.org rc_failed_per_10min=0i 16860372000000000
openqa_minion_workers,url=https://openqa.opensuse.org active=1i,inactive=0i,registered=1i

The route also works on osd:

openqa_minion_jobs,url=https://openqa.suse.de active=0i,delayed=0i,failed=3i,inactive=0i
openqa_minion_jobs_hook_rc_failed,url=https://openqa.suse.de rc_failed_per_10min=0i 16860372000000000
openqa_minion_workers,url=https://openqa.suse.de active=0i,inactive=1i,registered=1i
Actions #23

Updated by livdywan over 1 year ago

Discussed in the Unblock:

  • Why don't we use logwarn?

    • Do we always want to be alerted vs. a threshold?
    • Gru logs into the journal. That's not visible to logwarn, is it?
    • What about grafana on o3? There seems to be an instance that we weren't aware of (metrics.opensuse.org)
  • Why do we care to have more advanced monitoring on o3 now?

    • We have Dimstar
    • Dimstar has no plugin to monitor investigate jobs
    • Fixing things up as needed based on user feedback is considered okay for o3
    • Back to basics - simple options
    • Forget about the ratio for o3
    • Get logwarn to pick up logs from the journal OR move logging to files?
    • Try and write a munin plugin - this could be a super simple bash script written in minutes (example plugin for minion_jobs
    • sending emails from o3 is already possible and used e.g. by logwarn called by cron, see /etc/cron.d/logwarn_openqa which means that simply cron sends any stdout by email to o3-admins@suse.de
Actions #24

Updated by livdywan over 1 year ago

cdywan wrote:

- Try and write a munin plugin - this could be a super simple bash script written in minutes ([example plugin for minion_jobs](https://gist.github.com/perlpunk/4aec2d85c51b22261c0717afc896aa4c)

https://github.com/os-autoinst/openQA/pull/5194

Actions #25

Updated by tinita over 1 year ago

I'm fighting with getting alert emails to work with my local munin. The documentation about that is a bit confusing, and while I've written various munin plugins, I never configured email notifications myself.

Actions #27

Updated by tinita over 1 year ago

  • Copied to action #130778: Treat some openqa-clone-job failures non-fatal in openqa-investigate hook added
Actions #29

Updated by tinita over 1 year ago

I installed the munin plugin on o3, but for now it points to a clone under my personal user, because the openQA zypper package doesn't contain the contrib folder.

Create the tunnel ssh -L 8080:localhost:80 o3 and view the data:
http://127.0.0.1:8080/munin/minion-day.html

Actions #30

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5210 munin: Make alert thresholds configurable

Actions #31

Updated by tinita over 1 year ago

Actions #32

Updated by tinita over 1 year ago

Installed openQA-munin package on o3. For configuration see /etc/munin/plugin-conf.d/openqa-minion

Actions #33

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

We have grafana for OSD and munin with email sending on o3. We assume we are good now.

Actions #34

Updated by jbaier_cz over 1 year ago

  • Related to action #132665: [alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S added
Actions

Also available in: Atom PDF