action #165716
closedcoordination #102915: [saga][epic] Automated classification of failures
coordination #166655: [epic] openqa-label-known-issues
[o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M
0%
Description
Observation¶
We got an alert for o3:
opensuse.org :: openqa.opensuse.org :: hook failed - see openqa-gru service logs for details
WARNINGs: rc_failed_per_5min is 8.00 (outside range [:5]).
Here are the problematic lines in the journal:
sudo journalctl -u openqa-gru --since '2024-08-23'
Aug 23 00:03:23 ariel systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Aug 23 00:03:25 ariel systemd[1]: openqa-gru.service: Deactivated successfully.
Aug 23 00:03:25 ariel systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Aug 23 00:03:25 ariel systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
Aug 23 08:03:03 ariel openqa-gru[18277]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:12 ariel openqa-gru[18715]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:26 ariel openqa-gru[19152]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:44 ariel openqa-gru[19454]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:06 ariel openqa-gru[19770]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:22 ariel openqa-gru[20283]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:35 ariel openqa-gru[20569]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:40 ariel openqa-gru[20836]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:05:43 ariel openqa-gru[22016]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:06:45 ariel openqa-gru[24067]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
I can find the corresponding minion entries. They have a hook_rc of 1, but unfortunately no useful output.
https://openqa.opensuse.org/minion/jobs?id=4223339
https://openqa.opensuse.org/minion/jobs?id=4223321
https://openqa.opensuse.org/minion/jobs?id=4223308
We also have a few of those errors on osd.
The first error I can find on o3 is from August 18:
Aug 18 01:30:04 ariel openqa-gru[31244]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
For osd it's the 16:
Aug 16 11:57:21 openqa openqa-gru[7266]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
More detail¶
investigate_issue¶
We write autoinst-log.txt and reason into the same file.
If we successfully got autoinst-log.txt (http 200 or 301), we continue with trying to label the test. DONE.
No autoinst-log.txt¶
- If we couldn't fetch autoinst-log.txt, check if there is a general issue. handle_unreachable performs various tests and should return non-zero to indicate that we shouldn't go on trying to label the test.
- If the http status was not 404, don't continue with labeling.
- If the job is too old, don't continue with labeling.
- If there is no reason as well, don't continue with labeling. Only if the http status is 404, the job is not too old and the reason is set, continue with labeling.
Acceptance Criteria¶
- AC1: Hook script does not abort when label_on_issues_from_issue_tracker does return non-zero
- AC2: The relevant part of the script is tested
- AC3: Behaviour from before the previous ticket/PR is reinstated
Updated by tinita 3 months ago
The dates point to https://github.com/os-autoinst/scripts/pull/335 as the culprit, but hard to tell what the problem is, as line 68 is really just a function header
Updated by tinita 3 months ago
- Related to action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:S added
Updated by ybonatakis 3 months ago
tinita wrote in #note-7:
It's reproducible with
./openqa-label-known-issues https://openqa.opensuse.org/tests/4425222
Calling the script with the old code still fails but silently
❯ dry_run=1 ./openqa-label-known-issues https://openqa.opensuse.org/tests/4425222
Requesting jobs/4425222 via openqa-cli
~/Documents/Work/qatools/repos/scripts on add_requirements_to_run_script *1 ?2
❯ echo $?
127
What is "special" with https://openqa.opensuse.org/tests/4425222 is that the CASEDIR uses a absolute tree branch which I think the openQA doesnt support, as it uses the #mybranch
shortcut
Updated by tinita 3 months ago
In the existing (previous) code we have elif label_on_issues_from_issue_tracker "$id"; then
And the function label_on_issues_from_issue_tracker
is expected to fail currently, because it just calls label-on-issue
which passes if it wrote a comment, but fails if it didn't, just that the failure doesn't indicate something fatal. In which case it would try the next elif
branch.
And because of the elif ... then
the error is catched.
But the newly added line above calls label_on_issues_from_issue_tracker
without an if/elif
or || something
, so that's why it's aborting the script.
I guesss if label_on_issues_from_issue_tracker
in the new code fails, we still want to go through the rest of the code, e.g. call label_on_issues_without_tickets
or handle_unreviewed
.
So the code needs to be rearranged a bit.
Updated by ybonatakis 3 months ago · Edited
As tina had mentioned in my last related PR and I mentioned https://github.com/os-autoinst/scripts/pull/342 the function needs to exit.
Updated by livdywan 3 months ago
ybonatakis wrote in #note-10:
As tina had mentioned in my last related PR and I mentioned https://github.com/os-autoinst/scripts/pull/342 the function needs to exit.
Alternative approach https://github.com/os-autoinst/scripts/pull/343
jbaier_cz wrote in #note-14:
Btw. as this ticket is not estimated and has no acceptance criteria; can I misuse that for demanding more test coverage? :)
As discussed in the unblock we agree that it'd be best to have tests first and then see which approach works and Yannis is already looking into that.
Updated by ybonatakis 3 months ago
- Subject changed from [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 to [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M
- Description updated (diff)
Updated by ybonatakis 3 months ago
Tests added and CI passes. Waiting for a new round of feedback
Updated by tinita 3 months ago · Edited
I'm trying to write down the logic I think we want:
investigate_issue¶
We write autoinst-log.txt and reason into the same file.
If we successfully got autoinst-log.txt (http 200 or 301), we continue with trying to label the test. DONE.
No autoinst-log.txt¶
- If we couldn't fetch autoinst-log.txt, check if there is a general issue.
handle_unreachable
performs various tests and should return non-zero to indicate that we shouldn't go on trying to label the test. - If the http status was not 404, don't continue with labeling.
- If the job is too old, don't continue with labelling.
- If there is no reason as well, don't continue with labeling.
Only if the http status is 404, the job is not too old and the reason is set, continue with labelling.
Updated by livdywan 3 months ago · Edited
- Subject changed from [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M to [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Let's re-estimate this at the next opportunity. I'd say @ybonatakis fulfilled AC2 here but discussing this we got confused several times on how the code works and what the outcome should be. At this point I would be inclined to block it on a ticket to rewrite in Perl or Python.
Updated by ybonatakis 3 months ago
As for now tests in CI fail but they work local. tests touch a part which it wasnt covered before. will try to fix
https://github.com/os-autoinst/scripts/actions/runs/10724123810/job/29739041706?pr=342 openqa-label-known-issues: line 63: hxselect: command not found is not from the test
Updated by livdywan 2 months ago
https://github.com/os-autoinst/scripts/pull/342 under final review. I also went through the various unresolved comments to highlight open questions and resolve all addressed suggestions.
Updated by livdywan 2 months ago
- Related to action #166649: Rewrite openqa-label-known-issues in Python or another better maintainable language added
Updated by livdywan 2 months ago
- Status changed from Feedback to Resolved
livdywan wrote in #note-23:
https://github.com/os-autoinst/scripts/pull/342 under final review. I also went through the various unresolved comments to highlight open questions and resolve all addressed suggestions.
't is done
Updated by ybonatakis 2 months ago
https://progress.opensuse.org/issues/166772 for concern from the discussion https://github.com/os-autoinst/scripts/pull/342/files#r1745228802
Updated by ybonatakis 2 months ago
- Related to action #166772: openqa-label-known-issues overrides size:S added