action #155173
closed[openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M
0%
Description
Observation¶
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_nginx@64bit-2G fails in
openqa_worker
Reproducible¶
Fails since (at least) Build :TW.26398 (current job)
Expected result¶
Last good: :TW.26397 (or more recent)
Suggestions¶
- Lookup older tickets and add as reference about adding os-autoinst-setup-multi-machine to openQA-in-openQA tests
- Try to reproduce and fix or simply apply a mitigation as applicable, e.g. increase timeout or retry or something
- The proper place to fix might be in the test code but could also be in os-autoinst-setup-multi-machine itself or even further low-level
Further details¶
Always latest result in this scenario: latest
Updated by jbaier_cz about 1 year ago
- Tags set to openqa-in-openqa
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by okurz about 1 year ago
- Subject changed from [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out to [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 year ago
- Related to action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M added
Updated by okurz about 1 year ago
- Related to action #138302: Ensure automated openQA tests verify that os-autoinst-setup-multi-machine sets up valid networking size:M added
Updated by okurz about 1 year ago
I did not realize that https://openqa.opensuse.org/group_overview/24?limit_builds=50&limit_builds=100&limit_builds=400 looks so bad, bumping prio to "Urgent". I assume this is related to #138302 and possibly missing notifications due to #150956
Updated by okurz about 1 year ago
- Priority changed from Urgent to High
openqa_install_multimachine seems to be more aware, reducing to "High"
Updated by mkittler about 1 year ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 1 year ago
Not sure whether os-autoinst-setup-multi-machine
really timed out (it generally seemed to work; I couldn't spot an error message or stuck command in the video). So maybe the problem is that the end marker 2HR0K-
doesn't appear in the serial log.
Updated by openqa_review about 1 year ago
- Due date set to 2024-03-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 1 year ago ยท Edited
I had a look at os-autoinst's code to see how assert_script_run
works. Apparently it is using the same serial device that also ends up as serial0.txt
log for the marker. And that file contains the marker:
[ 495.644496][T19426] Failed to associated timeout policy `ovs_test_tp'
[ 496.217552][T19426] tap137: entered promiscuous mode
2HR0K-0-
[ 496.781360][T19426] tap138: entered promiscuous mode
[ 497.539160][T19426] tap139: entered promiscuous mode
There were some other messages but the marker it left in one piece. So probably this is really just a time out - as the error message suggests.
The last output we see (in the video) from the setup script is the symlinking that corresponds to systemctl enable os-autoinst-openvswitch
. So the script is already almost done (and not stuck somewhere in the middle). The only further command is systemctl restart openvswitch os-autoinst-openvswitch
and most likely os-autoinst-openvswitch
took too long to restart. The restart will really block until the service is ready because it is of type dbus
so systemd will wait until it actually appears on the dbus - and that in turn won't happen until all the initialization logic in os-autoinst-openvswitch
has happened. Considering we have already
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"
as an override for this systemd unit in production it is fair to assume that 90 seconds for the whole os-autoinst-setup-multi-machine
is probably not enough. So I'll be increasing the timeout.
Updated by mkittler about 1 year ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 year ago
PR for fixing the worker
module running into the screen saver: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/166
Updated by okurz about 1 year ago
- Status changed from Feedback to In Progress
I merged related https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/165 . https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/166 CI checks were failing due to the error fixed in https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/165 . I asked mergifyio to rebase. This still needs a second approval.
Updated by mkittler about 1 year ago
- Status changed from In Progress to Resolved
The PRs have been merged. With that we can consider the issue resolved. (I also created https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/167 but that's really not part of this issue.)
Updated by okurz about 1 year ago
- Status changed from Resolved to In Progress
ok, looks promising. But for a sporadic issue we should be strict and look for a statistically sound verification, especially since we currently do not have automatic notifications about failing openQA-in-openQA tests.
Updated by mkittler about 1 year ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 year ago
- Status changed from Feedback to Resolved
I haven't seen jobs failing anymore in openqa_worker
(and also not in worker
which is technically a different issue) for the last 3 days (after checking Next & Previous of all relevant scenarios) and lots of jobs haven been scheduled since then (over 80 alone in the scenario mentioned in the PR description).