action #132665
closed
coordination #102915: [saga][epic] Automated classification of failures
coordination #166655: [epic] openqa-label-known-issues
[alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S
Added by jbaier_cz over 1 year ago.
Updated 2 months ago.
Category:
Regressions/Crashes
Description
Observation¶
Munin reported multiple minion hook failures on o3 between 2023-07-12T14:00:00Z and 2023-07-12T16:00:00Z, relevant minion jobs are finished (not a failure).
Example of such job: https://openqa.opensuse.org/minion/jobs?id=2690714
---
args:
- env from_email=o3-admins@suse.de scheme=http enable_force_result=true email_unreviewed=true
exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
- 3422493
- delay: 60
kill_timeout: 30s
retries: 1440
skip_rc: 142
timeout: 5m
attempts: 1
children: []
created: 2023-07-12T14:56:55.497617Z
delayed: 2023-07-12T14:56:55.497617Z
expires: ~
finished: 2023-07-12T14:57:00.532609Z
id: 2690714
lax: 0
notes:
hook_cmd: env from_email=o3-admins@suse.de scheme=http enable_force_result=true
email_unreviewed=true exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*'
/opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
hook_rc: 1
hook_result: ''
parents: []
priority: 0
queue: default
result: ~
retried: ~
retries: 0
started: 2023-07-12T14:56:55.503134Z
state: finished
task: hook_script
time: 2023-07-13T08:52:43.144616Z
worker: 1674
Suggestions¶
- Extend the code to always include an error message or a link to somewhere else
- Investigate the underlying issue
- Maybe just temporary?
- Related to action #99741: Minion jobs for job hooks failed silently on o3 size:M added
- Subject changed from [alert] Minion hook failed on o3 to [alert] openqa-label-known-issues-and-investigate minion hook failed on o3 size:S
- Description updated (diff)
- Status changed from New to Workable
- Category set to Regressions/Crashes
- Status changed from Workable to In Progress
- Assignee set to tinita
For failing minion hooks, the log to look at is the gru journal:
% journalctl --since=2023-07-12 -u openqa-gru
...
Jul 12 14:16:27 ariel openqa-gru[20052]: /opt/os-autoinst-scripts/_common: ERROR: line 77
Jul 12 14:16:27 ariel openqa-gru[20050]: /opt/os-autoinst-scripts/_common: ERROR: line 77
Jul 12 14:16:27 ariel openqa-gru[20050]: curl (152 /opt/os-autoinst-scripts/openqa-label-known-issues): Error fetching (--user-agent openqa-label-known-issues -s https://progress.opensuse.org/projects/openqav3/issues.json?limit=200&subproject_id=*&subject=~auto_review%3A):
Jul 12 14:16:27 ariel openqa-gru[20050]: 000
Jul 12 14:16:27 ariel openqa-gru[20049]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 152
Jul 12 14:16:27 ariel openqa-gru[20044]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 152
...
multiple times until 16:06:33
There was a long duration where progress.opensuse.org was apparently not reachable. The 000
indicates that, where we should normally see the HTTP status code.
I think that part can be improved by using curl -s -S
instead of curl -s
to give us an error message in such cases.
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/scripts/pull/248 - Let curl output an error message on error
I think in such cases we have to live with an alert, although it's nothing on our side we could have fixed. progress.opensuse.org just was not reachable for almost 2 hours, and a retry wouldn't have helped.
However, we do get a 503 quite regularly (meaning once a day) for p.o.o, probably when it's maintained, and I think at least a retry with an appropriate long delay for that could make sense.
About the issue that it was apparently not known by everyone which error log to look at - I'm not sure how to solve that.
Putting something like hook_stderr: look-at-gru-journal
into the minion data seems a bit silly maybe? :)
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/scripts/pull/248 merged. As suggested in the daily I suggest to actually not invest any more effort here. People just need to take a look into the gru logs on problems where the reason of failure is not clear. That's just how it is :)
okurz wrote:
https://github.com/os-autoinst/scripts/pull/248 merged. As suggested in the daily I suggest to actually not invest any more effort here. People just need to take a look into the gru logs on problems where the reason of failure is not clear. That's just how it is :)
Actually we discussed it together so you don't need to play the bad cop here ;-) I was voicing concerns with regards to handling alerts but the curl change is already helpful as-is. We can and should take small steps and re-evaluate. So all good. And thank you Tina for raising it before implementing super fancy log parsing.
- Parent task set to #166655
Also available in: Atom
PDF