Project

General

Profile

Actions

action #132665

closed

coordination #102915: [saga][epic] Automated classification of failures

coordination #166655: [epic] openqa-label-known-issues

[alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S

Added by jbaier_cz over 1 year ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-07-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

Munin reported multiple minion hook failures on o3 between 2023-07-12T14:00:00Z and 2023-07-12T16:00:00Z, relevant minion jobs are finished (not a failure).

Example of such job: https://openqa.opensuse.org/minion/jobs?id=2690714

---
args:
- env from_email=o3-admins@suse.de scheme=http enable_force_result=true email_unreviewed=true
  exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
- 3422493
- delay: 60
  kill_timeout: 30s
  retries: 1440
  skip_rc: 142
  timeout: 5m
attempts: 1
children: []
created: 2023-07-12T14:56:55.497617Z
delayed: 2023-07-12T14:56:55.497617Z
expires: ~
finished: 2023-07-12T14:57:00.532609Z
id: 2690714
lax: 0
notes:
  hook_cmd: env from_email=o3-admins@suse.de scheme=http enable_force_result=true
    email_unreviewed=true exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
    /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
  hook_rc: 1
  hook_result: ''
parents: []
priority: 0
queue: default
result: ~
retried: ~
retries: 0
started: 2023-07-12T14:56:55.503134Z
state: finished
task: hook_script
time: 2023-07-13T08:52:43.144616Z
worker: 1674

Suggestions

  • Extend the code to always include an error message or a link to somewhere else
  • Investigate the underlying issue
  • Maybe just temporary?

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #99741: Minion jobs for job hooks failed silently on o3 size:MResolveddheidler2021-10-04

Actions
Actions #1

Updated by jbaier_cz over 1 year ago

  • Related to action #99741: Minion jobs for job hooks failed silently on o3 size:M added
Actions #2

Updated by livdywan over 1 year ago

  • Subject changed from [alert] Minion hook failed on o3 to [alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by tinita over 1 year ago

  • Category set to Regressions/Crashes
  • Status changed from Workable to In Progress
  • Assignee set to tinita

For failing minion hooks, the log to look at is the gru journal:

% journalctl --since=2023-07-12 -u openqa-gru
...
Jul 12 14:16:27 ariel openqa-gru[20052]: /opt/os-autoinst-scripts/_common: ERROR: line 77
Jul 12 14:16:27 ariel openqa-gru[20050]: /opt/os-autoinst-scripts/_common: ERROR: line 77
Jul 12 14:16:27 ariel openqa-gru[20050]: curl (152 /opt/os-autoinst-scripts/openqa-label-known-issues): Error fetching (--user-agent openqa-label-known-issues -s https://progress.opensuse.org/projects/openqav3/issues.json?limit=200&subproject_id=*&subject=~auto_review%3A):
Jul 12 14:16:27 ariel openqa-gru[20050]: 000
Jul 12 14:16:27 ariel openqa-gru[20049]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 152
Jul 12 14:16:27 ariel openqa-gru[20044]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 152
...
multiple times until 16:06:33

There was a long duration where progress.opensuse.org was apparently not reachable. The 000 indicates that, where we should normally see the HTTP status code.
I think that part can be improved by using curl -s -S instead of curl -s to give us an error message in such cases.

Actions #4

Updated by tinita over 1 year ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/scripts/pull/248 - Let curl output an error message on error

I think in such cases we have to live with an alert, although it's nothing on our side we could have fixed. progress.opensuse.org just was not reachable for almost 2 hours, and a retry wouldn't have helped.

However, we do get a 503 quite regularly (meaning once a day) for p.o.o, probably when it's maintained, and I think at least a retry with an appropriate long delay for that could make sense.

About the issue that it was apparently not known by everyone which error log to look at - I'm not sure how to solve that.
Putting something like hook_stderr: look-at-gru-journal into the minion data seems a bit silly maybe? :)

Actions #5

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/scripts/pull/248 merged. As suggested in the daily I suggest to actually not invest any more effort here. People just need to take a look into the gru logs on problems where the reason of failure is not clear. That's just how it is :)

Actions #6

Updated by livdywan over 1 year ago

okurz wrote:

https://github.com/os-autoinst/scripts/pull/248 merged. As suggested in the daily I suggest to actually not invest any more effort here. People just need to take a look into the gru logs on problems where the reason of failure is not clear. That's just how it is :)

Actually we discussed it together so you don't need to play the bad cop here ;-) I was voicing concerns with regards to handling alerts but the curl change is already helpful as-is. We can and should take small steps and re-evaluate. So all good. And thank you Tina for raising it before implementing super fancy log parsing.

Actions #7

Updated by okurz about 2 months ago

  • Parent task set to #166655
Actions

Also available in: Atom PDF