action #124212
closedUnreviewed issue for "obvious" needle mismatch without any indication what unknown error was found size:M
Description
Observation¶
openqa_install+publish jobs in openQA in OpenQA tests started failing, which caused a wave of Unreviewed issue (Group 24 openQA) emails to be sent every half hour or so.
The emails contain this:
# --- 8< ---
# [2023-02-09T09:03:50.690315+01:00] [debug] [pid:21411] QEMU status is not 'shutdown', it is 'running'
# [2023-02-09T09:03:50.690400+01:00] [debug] [pid:21268] backend shutdown state:
# [2023-02-09T09:03:50.690645+01:00] [info] [pid:21411] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
# [2023-02-09T09:03:51.741976+01:00] [debug] [pid:21411] Passing remaining frames to the video encoder
# frame056 fps=0 q=0 Lsize!41kB time:02:07.29 bitrate7.8kbits/s speed=083x
# video:2122kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.936463%
# [2023-02-09T09:03:54.217738+01:00] [debug] [pid:21411] Waiting for video encoder to finalize the video
# [2023-02-09T09:03:54.217801+01:00] [debug] [pid:21411] The external video encoder (pid 21537) terminated
# [2023-02-09T09:03:54.217841+01:00] [debug] [pid:21411] The built-in video encoder (pid 21538) terminated
# [2023-02-09T09:03:54.218255+01:00] [debug] [pid:21411] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 21411 (/usr/bin/isotovideo: backen)
# --- >8 ---
The SIGTERM
is expected here. More relevant messages can also be found in the log:
[2023-02-09T10:36:43.583422+01:00] [debug] [pid:12662] no match: -0.9s, best candidate: openqa-dashboard-no_jobs-tumbleweed-20200106 (0.00)
[2023-02-09T10:36:43.892722+01:00] [debug] [pid:12502] >>> testapi::_check_backend_response: match=openqa-dashboard timed out after 360 (assert_screen)
[2023-02-09T10:36:43.986352+01:00] [info] [pid:12502] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'openqa-dashboard' matched
Acceptance criteria¶
- AC1: There is always a helpful hint explaining why the email was sent
- AC2: Jobs are not built so frequently that they cause email floods
Out of scope¶
- A failure in the same job doesn't cause repeated emails
Suggestions¶
- Verify that a comment like
poo#124143
prevents an "unreview issue" from being detected - Investigate what error message if any triggered the script
- Add a note on why the job is "unreviewed" i.e. unreviewed always means no bug ref
- Always include
no candidate needle with tag(s)
messages in the email - Consider explicitly treating needle mismatches as "reviewed"
- There's
[2023-02-09T10:37:10.650093+01:00] [warn] [pid:12502] !!! testapi::script_run: DEPRECATED call of script_run() in lib/openQAcoretest.pm:8 add
die_on_timeout => ?to the call or set $distri->{script_run_die_on_timeout} to avoid this warning
which could be seen as an unreviewed error message but it's not seen in the snippet - Check if this could be an unintended side-effect of #98862
- Check the openQA-in-openQA trigger-test-monitor pipeline in jenkins.qa.suse.de/ , maybe we trigger too many
Files
Updated by livdywan almost 2 years ago
- Copied from action #124143: openqa-in-openqa test fails because text color changed - missing CSS? size:M added
Updated by tinita almost 2 years ago
All the "Unreviewed issue" notifications I can find link to jobs where there was no bugref carryover yet.
If a carryover happens then the hook script is not called, so no email should be sent.
The first bugref for https://openqa.opensuse.org/tests/3105735#next_previous openqa-Tumbleweed-dev-x86_64-Build:TW.17520-openqa_install+publish@64bit-2G was added a day later https://openqa.opensuse.org/tests/3106678#comments
The first bugref for https://openqa.opensuse.org/tests/3105702#next_previous openqa-Tumbleweed-dev-x86_64-Build:TW.17519-openqa_from_git@64bit-2G was added directly after that job https://openqa.opensuse.org/tests/3105734#comments
So for openqa_install+publish we still got emails until the next morning until the bugref was added.
Probably it was just confusing because we have different test scenarios with the same failure, and the bugref was first only added to one of the scenario.
Or do you have a specific notification where you think that shouldn't have been sent?
Updated by tinita almost 2 years ago
Consider explicitly treating needle mismatches as "reviewed"
Why should a needle mismatch be treated as reviewed?
Updated by mkittler almost 2 years ago
When I've seen the mail food I initially confused it with messages from logwarn. So I was wondering for what specific place in the openQA job logs these mails were generated for and why the mail is generated considering there's nothing wrong with os-autoinst or openQA themselves. But apparently it is really just about the job not having a bugref and also no ticket with auto_review regex.
Nevertheless it is bad to get a flood of mails for one issue (until someone is able to create at least a ticket about it).
Updated by livdywan almost 2 years ago
- Subject changed from Unreviewed issue for "obvious" needle mismatch without any indication what unknown error was found to Unreviewed issue for "obvious" needle mismatch without any indication what unknown error was found size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan almost 2 years ago
- Copied to action #124694: Redundant email about new comment in OBS added
Updated by mkittler almost 2 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
I've set the frequency of http://jenkins.qa.suse.de/job/trigger-openQA_in_openQA-TW/configure to hourly (H/15 …
-> H …
). I think this should cover AC1. At lest the flood of mails will be reduced by a factor of 4. We can always adjust this on Jenkins-level.
Note that Jenkins also polls http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/noarch for changes (content and modification date). However, this is likely pointless because the web server doesn't return a last modified date (see curl -s -v -X HEAD http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/noarch
) and the contents contain a csrf-token that is likely to change between requests.
Updated by mkittler almost 2 years ago
- Status changed from In Progress to Feedback
This should improve the wording for AC2: https://github.com/os-autoinst/scripts/pull/218
Updated by tinita almost 2 years ago
I added a screenshot of how the notification looks now
Updated by mkittler almost 2 years ago
- Status changed from Feedback to Resolved
I think it is clear enough. The reduced triggering frequency is also visibly effective on https://openqa.opensuse.org/group_overview/24. So I'm considering this resolved.