Project

General

Profile

Actions

action #124212

closed

Unreviewed issue for "obvious" needle mismatch without any indication what unknown error was found size:M

Added by livdywan almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-02-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

openqa_install+publish jobs in openQA in OpenQA tests started failing, which caused a wave of Unreviewed issue (Group 24 openQA) emails to be sent every half hour or so.

The emails contain this:

# --- 8< ---
# [2023-02-09T09:03:50.690315+01:00] [debug] [pid:21411] QEMU status is not 'shutdown', it is 'running'
# [2023-02-09T09:03:50.690400+01:00] [debug] [pid:21268] backend shutdown state: 
# [2023-02-09T09:03:50.690645+01:00] [info] [pid:21411] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
# [2023-02-09T09:03:51.741976+01:00] [debug] [pid:21411] Passing remaining frames to the video encoder
# frame056 fps=0 q=0 Lsize!41kB time:02:07.29 bitrate7.8kbits/s speed=083x    
# video:2122kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.936463%
# [2023-02-09T09:03:54.217738+01:00] [debug] [pid:21411] Waiting for video encoder to finalize the video
# [2023-02-09T09:03:54.217801+01:00] [debug] [pid:21411] The external video encoder (pid 21537) terminated
# [2023-02-09T09:03:54.217841+01:00] [debug] [pid:21411] The built-in video encoder (pid 21538) terminated
# [2023-02-09T09:03:54.218255+01:00] [debug] [pid:21411] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 21411 (/usr/bin/isotovideo: backen)
# --- >8 ---

The SIGTERM is expected here. More relevant messages can also be found in the log:

[2023-02-09T10:36:43.583422+01:00] [debug] [pid:12662] no match: -0.9s, best candidate: openqa-dashboard-no_jobs-tumbleweed-20200106 (0.00)
[2023-02-09T10:36:43.892722+01:00] [debug] [pid:12502] >>> testapi::_check_backend_response: match=openqa-dashboard timed out after 360 (assert_screen)
[2023-02-09T10:36:43.986352+01:00] [info] [pid:12502] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'openqa-dashboard' matched

Acceptance criteria

  • AC1: There is always a helpful hint explaining why the email was sent
  • AC2: Jobs are not built so frequently that they cause email floods

Out of scope

  • A failure in the same job doesn't cause repeated emails

Suggestions

  • Verify that a comment like poo#124143 prevents an "unreview issue" from being detected
  • Investigate what error message if any triggered the script
  • Add a note on why the job is "unreviewed" i.e. unreviewed always means no bug ref
  • Always include no candidate needle with tag(s) messages in the email
  • Consider explicitly treating needle mismatches as "reviewed"
  • There's [2023-02-09T10:37:10.650093+01:00] [warn] [pid:12502] !!! testapi::script_run: DEPRECATED call of script_run() in lib/openQAcoretest.pm:8 adddie_on_timeout => ?to the call or set $distri->{script_run_die_on_timeout} to avoid this warning which could be seen as an unreviewed error message but it's not seen in the snippet
  • Check if this could be an unintended side-effect of #98862
  • Check the openQA-in-openQA trigger-test-monitor pipeline in jenkins.qa.suse.de/ , maybe we trigger too many

Files


Related issues 2 (0 open2 closed)

Copied from openQA Project (public) - action #124143: openqa-in-openqa test fails because text color changed - missing CSS? size:MResolvedmkittler2023-02-08

Actions
Copied to openQA Project (public) - action #124694: Redundant email about new comment in OBSResolvedmkittler2023-02-08

Actions
Actions #1

Updated by livdywan almost 2 years ago

  • Copied from action #124143: openqa-in-openqa test fails because text color changed - missing CSS? size:M added
Actions #2

Updated by tinita almost 2 years ago

All the "Unreviewed issue" notifications I can find link to jobs where there was no bugref carryover yet.
If a carryover happens then the hook script is not called, so no email should be sent.

The first bugref for https://openqa.opensuse.org/tests/3105735#next_previous openqa-Tumbleweed-dev-x86_64-Build:TW.17520-openqa_install+publish@64bit-2G was added a day later https://openqa.opensuse.org/tests/3106678#comments
The first bugref for https://openqa.opensuse.org/tests/3105702#next_previous openqa-Tumbleweed-dev-x86_64-Build:TW.17519-openqa_from_git@64bit-2G was added directly after that job https://openqa.opensuse.org/tests/3105734#comments

So for openqa_install+publish we still got emails until the next morning until the bugref was added.
Probably it was just confusing because we have different test scenarios with the same failure, and the bugref was first only added to one of the scenario.

Or do you have a specific notification where you think that shouldn't have been sent?

Actions #3

Updated by tinita almost 2 years ago

Consider explicitly treating needle mismatches as "reviewed"

Why should a needle mismatch be treated as reviewed?

Actions #4

Updated by mkittler almost 2 years ago

When I've seen the mail food I initially confused it with messages from logwarn. So I was wondering for what specific place in the openQA job logs these mails were generated for and why the mail is generated considering there's nothing wrong with os-autoinst or openQA themselves. But apparently it is really just about the job not having a bugref and also no ticket with auto_review regex.

Nevertheless it is bad to get a flood of mails for one issue (until someone is able to create at least a ticket about it).

Actions #5

Updated by livdywan almost 2 years ago

  • Subject changed from Unreviewed issue for "obvious" needle mismatch without any indication what unknown error was found to Unreviewed issue for "obvious" needle mismatch without any indication what unknown error was found size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan almost 2 years ago

  • Copied to action #124694: Redundant email about new comment in OBS added
Actions #7

Updated by mkittler almost 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

I've set the frequency of http://jenkins.qa.suse.de/job/trigger-openQA_in_openQA-TW/configure to hourly (H/15 … -> H …). I think this should cover AC1. At lest the flood of mails will be reduced by a factor of 4. We can always adjust this on Jenkins-level.

Note that Jenkins also polls http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/noarch for changes (content and modification date). However, this is likely pointless because the web server doesn't return a last modified date (see curl -s -v -X HEAD http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/noarch) and the contents contain a csrf-token that is likely to change between requests.

Actions #8

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback

This should improve the wording for AC2: https://github.com/os-autoinst/scripts/pull/218

Actions #9

Updated by tinita almost 2 years ago

I added a screenshot of how the notification looks now

Actions #10

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

I think it is clear enough. The reduced triggering frequency is also visibly effective on https://openqa.opensuse.org/group_overview/24. So I'm considering this resolved.

Actions

Also available in: Atom PDF