Project

General

Profile

action #99741

Minion jobs for job hooks failed silently on o3

Added by cdywan over 1 year ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
Start date:
2021-10-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

Gru logs show entries like this for minion jobs:

Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.

Relevant minion jobs are shown as finished rather than "failed", e.g. https://openqa.opensuse.org/minion/jobs?id=800152 with the following details:

---
args:
- 1951060
- ~
attempts: 1
children: []
created: 2021-10-02T13:46:14.14573Z
delayed: 2021-10-02T13:46:14.14573Z
expires: ~
finished: 2021-10-02T13:46:14.41935Z
id: 800152
lax: 0
notes:
  gru_id: 17752756
  hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
    /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
  hook_rc: -1
parents: []
priority: -10
queue: default
result: Job successfully executed
retried: ~
retries: 0
started: 2021-10-02T13:46:14.15145Z
state: finished
task: finalize_job_results
time: 2021-10-04T13:59:22.27403Z
worker: 744

Acceptance criteria

  • AC1: Alerts are received if a high amount (or ratio) of hook scripts fail

Suggestions


Related issues

Related to openQA Infrastructure - action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectivelyWorkable2019-09-23

Copied from openQA Infrastructure - action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:MResolved

History

#1 Updated by cdywan over 1 year ago

  • Copied from action #99195: Upgrade o3 webUI host to openSUSE Leap 15.3 size:M added

#2 Updated by cdywan over 1 year ago

  • Description updated (diff)

#3 Updated by okurz over 1 year ago

  • Related to action #57239: Add/fix openqa_logwarn for o3 and osd sending to o3-admins@suse.de and osd-admins@suse.de respectively added

#4 Updated by okurz over 1 year ago

  • Priority changed from High to Normal
  • Target version changed from Ready to future

When talking about "o3" what do mean with "alert"? We don't have grafana if this is what you might confuse it with. There would be #57239 which could have helped. Taking ticket out of backlog again. We have 102 tickets in backlog which is too much.

#5 Updated by cdywan over 1 year ago

  • Subject changed from Failed minion jobs for job hooks didn't cause any alerts to Minion jobs for job hooks failed silently on o3
  • Description updated (diff)

When talking about "o3" what do mean with "alert"? We don't have grafana

My mistake. I didn't mean to prescribe an implementation. The point is that these are completely silent failures.

#6 Updated by tinita over 1 year ago

Just a note on how to find out about failing hook scripts manually, as we have the hook_rc now:

select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where cast(notes->'hook_rc' as int)  != 0 order by created;

#7 Updated by okurz 11 months ago

  • Description updated (diff)

Also available in: Atom PDF