Project

General

Profile

Actions

action #165716

closed

coordination #102915: [saga][epic] Automated classification of failures

coordination #166655: [epic] openqa-label-known-issues

[o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M

Added by tinita about 2 months ago. Updated 28 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

We got an alert for o3:
opensuse.org :: openqa.opensuse.org :: hook failed - see openqa-gru service logs for details
WARNINGs: rc_failed_per_5min is 8.00 (outside range [:5]).

Here are the problematic lines in the journal:

sudo journalctl -u openqa-gru --since '2024-08-23'
Aug 23 00:03:23 ariel systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Aug 23 00:03:25 ariel systemd[1]: openqa-gru.service: Deactivated successfully.
Aug 23 00:03:25 ariel systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Aug 23 00:03:25 ariel systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
Aug 23 08:03:03 ariel openqa-gru[18277]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:12 ariel openqa-gru[18715]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:26 ariel openqa-gru[19152]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:03:44 ariel openqa-gru[19454]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:06 ariel openqa-gru[19770]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:22 ariel openqa-gru[20283]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:35 ariel openqa-gru[20569]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:04:40 ariel openqa-gru[20836]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:05:43 ariel openqa-gru[22016]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Aug 23 08:06:45 ariel openqa-gru[24067]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68

I can find the corresponding minion entries. They have a hook_rc of 1, but unfortunately no useful output.
https://openqa.opensuse.org/minion/jobs?id=4223339
https://openqa.opensuse.org/minion/jobs?id=4223321
https://openqa.opensuse.org/minion/jobs?id=4223308

We also have a few of those errors on osd.

The first error I can find on o3 is from August 18:

Aug 18 01:30:04 ariel openqa-gru[31244]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68

For osd it's the 16:

Aug 16 11:57:21 openqa openqa-gru[7266]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68

More detail

investigate_issue

We write autoinst-log.txt and reason into the same file.
If we successfully got autoinst-log.txt (http 200 or 301), we continue with trying to label the test. DONE.

No autoinst-log.txt

  • If we couldn't fetch autoinst-log.txt, check if there is a general issue. handle_unreachable performs various tests and should return non-zero to indicate that we shouldn't go on trying to label the test.
  • If the http status was not 404, don't continue with labeling.
  • If the job is too old, don't continue with labeling.
  • If there is no reason as well, don't continue with labeling. Only if the http status is 404, the job is not too old and the reason is set, continue with labeling.

Acceptance Criteria

  • AC1: Hook script does not abort when label_on_issues_from_issue_tracker does return non-zero
  • AC2: The relevant part of the script is tested
  • AC3: Behaviour from before the previous ticket/PR is reinstated

Related issues 3 (2 open1 closed)

Related to openQA Project - action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:SResolvedybonatakis

Actions
Related to openQA Project - action #166649: Rewrite openqa-label-known-issues in Python or another better maintainable languageNew

Actions
Related to openQA Project - action #166772: openqa-label-known-issues overrides New2024-09-13

Actions
Actions #1

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #2

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #3

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #4

Updated by tinita about 2 months ago

The dates point to https://github.com/os-autoinst/scripts/pull/335 as the culprit, but hard to tell what the problem is, as line 68 is really just a function header

Actions #5

Updated by tinita about 2 months ago

  • Related to action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:S added
Actions #6

Updated by ybonatakis about 2 months ago

  • Assignee set to ybonatakis
Actions #7

Updated by tinita about 2 months ago

It's reproducible with

./openqa-label-known-issues https://openqa.opensuse.org/tests/4425222
Actions #8

Updated by ybonatakis about 2 months ago

tinita wrote in #note-7:

It's reproducible with

./openqa-label-known-issues https://openqa.opensuse.org/tests/4425222

Calling the script with the old code still fails but silently

❯ dry_run=1 ./openqa-label-known-issues https://openqa.opensuse.org/tests/4425222
Requesting jobs/4425222 via openqa-cli
    ~/Documents/Work/qatools/repos/scripts on    add_requirements_to_run_script *1 ?2                                                                                                                                                                      
❯ echo $?
127

What is "special" with https://openqa.opensuse.org/tests/4425222 is that the CASEDIR uses a absolute tree branch which I think the openQA doesnt support, as it uses the #mybranch shortcut

Actions #9

Updated by tinita about 2 months ago

In the existing (previous) code we have elif label_on_issues_from_issue_tracker "$id"; then
And the function label_on_issues_from_issue_tracker is expected to fail currently, because it just calls label-on-issue which passes if it wrote a comment, but fails if it didn't, just that the failure doesn't indicate something fatal. In which case it would try the next elif branch.
And because of the elif ... then the error is catched.

But the newly added line above calls label_on_issues_from_issue_tracker without an if/elif or || something , so that's why it's aborting the script.

I guesss if label_on_issues_from_issue_tracker in the new code fails, we still want to go through the rest of the code, e.g. call label_on_issues_without_tickets or handle_unreviewed.
So the code needs to be rearranged a bit.

Actions #10

Updated by ybonatakis about 2 months ago · Edited

As tina had mentioned in my last related PR and I mentioned https://github.com/os-autoinst/scripts/pull/342 the function needs to exit.

Actions #11

Updated by ybonatakis about 2 months ago

  • Status changed from New to Feedback
Actions #12

Updated by ybonatakis about 2 months ago

  • Status changed from Feedback to In Progress
Actions #13

Updated by ybonatakis about 2 months ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by jbaier_cz about 2 months ago

Btw. as this ticket is not estimated and has no acceptance criteria; can I misuse that for demanding more test coverage? :)

Actions #15

Updated by livdywan about 1 month ago

ybonatakis wrote in #note-10:

As tina had mentioned in my last related PR and I mentioned https://github.com/os-autoinst/scripts/pull/342 the function needs to exit.

Alternative approach https://github.com/os-autoinst/scripts/pull/343

jbaier_cz wrote in #note-14:

Btw. as this ticket is not estimated and has no acceptance criteria; can I misuse that for demanding more test coverage? :)

As discussed in the unblock we agree that it'd be best to have tests first and then see which approach works and Yannis is already looking into that.

Actions #16

Updated by ybonatakis about 1 month ago

  • Subject changed from [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 to [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M
  • Description updated (diff)
Actions #17

Updated by ybonatakis about 1 month ago

Tests added and CI passes. Waiting for a new round of feedback

Actions #18

Updated by livdywan about 1 month ago

ybonatakis wrote in #note-17:

Tests added and CI passes. Waiting for a new round of feedback

https://github.com/os-autoinst/scripts/pull/342#issuecomment-2325027337

Actions #19

Updated by tinita about 1 month ago · Edited

I'm trying to write down the logic I think we want:

investigate_issue

We write autoinst-log.txt and reason into the same file.
If we successfully got autoinst-log.txt (http 200 or 301), we continue with trying to label the test. DONE.

No autoinst-log.txt

  • If we couldn't fetch autoinst-log.txt, check if there is a general issue. handle_unreachable performs various tests and should return non-zero to indicate that we shouldn't go on trying to label the test.
  • If the http status was not 404, don't continue with labeling.
  • If the job is too old, don't continue with labelling.
  • If there is no reason as well, don't continue with labeling.

Only if the http status is 404, the job is not too old and the reason is set, continue with labelling.

Actions #20

Updated by livdywan about 1 month ago · Edited

  • Subject changed from [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M to [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68

Let's re-estimate this at the next opportunity. I'd say @ybonatakis fulfilled AC2 here but discussing this we got confused several times on how the code works and what the outcome should be. At this point I would be inclined to block it on a ticket to rewrite in Perl or Python.

Actions #21

Updated by livdywan about 1 month ago

  • Subject changed from [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 to [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M
  • Description updated (diff)
Actions #22

Updated by ybonatakis about 1 month ago

As for now tests in CI fail but they work local. tests touch a part which it wasnt covered before. will try to fix

https://github.com/os-autoinst/scripts/actions/runs/10724123810/job/29739041706?pr=342 openqa-label-known-issues: line 63: hxselect: command not found is not from the test

Actions #23

Updated by livdywan 30 days ago

https://github.com/os-autoinst/scripts/pull/342 under final review. I also went through the various unresolved comments to highlight open questions and resolve all addressed suggestions.

Actions #24

Updated by livdywan 30 days ago

  • Related to action #166649: Rewrite openqa-label-known-issues in Python or another better maintainable language added
Actions #25

Updated by okurz 30 days ago

  • Parent task set to #166655
Actions #26

Updated by ybonatakis 29 days ago

CI passes

Actions #27

Updated by livdywan 28 days ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-23:

https://github.com/os-autoinst/scripts/pull/342 under final review. I also went through the various unresolved comments to highlight open questions and resolve all addressed suggestions.

't is done

Actions #29

Updated by ybonatakis 28 days ago

  • Related to action #166772: openqa-label-known-issues overrides added
Actions

Also available in: Atom PDF