action #155173: [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #155173

closed

[openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M

Added by tinita 11 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-02-08

Due date:

2024-03-01

% Done:

Estimated time:

Tags:

openqa-in-openqa, reactive work

Description

Observation¶

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_nginx@64bit-2G fails in
openqa_worker

Reproducible¶

Fails since (at least) Build :TW.26398 (current job)

Expected result¶

Last good: :TW.26397 (or more recent)

Suggestions¶

Lookup older tickets and add as reference about adding os-autoinst-setup-multi-machine to openQA-in-openQA tests
Try to reproduce and fix or simply apply a mitigation as applicable, e.g. increase timeout or retry or something
The proper place to fix might be in the test code but could also be in os-autoinst-setup-multi-machine itself or even further low-level

Further details¶

Always latest result in this scenario: latest

Related issues 2 (0 open — 2 closed)

Related to openQA Project (public) - action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M

Resolved

ybonatakis

2024-02-08

2024-02-29

Actions

Related to openQA Project (public) - action #138302: Ensure automated openQA tests verify that os-autoinst-setup-multi-machine sets up valid networking size:M

Resolved

dheidler

2023-07-19

2024-01-19

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by jbaier_cz 11 months ago

Tags set to openqa-in-openqa
Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by okurz 11 months ago

Subject changed from [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out to [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 11 months ago

Related to action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M added

Actions

Copy link

Updated by okurz 10 months ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz 10 months ago

Related to action #138302: Ensure automated openQA tests verify that os-autoinst-setup-multi-machine sets up valid networking size:M added

Actions

Copy link

Updated by okurz 10 months ago

I did not realize that https://openqa.opensuse.org/group_overview/24?limit_builds=50&limit_builds=100&limit_builds=400 looks so bad, bumping prio to "Urgent". I assume this is related to #138302 and possibly missing notifications due to #150956

Actions

Copy link

Updated by okurz 10 months ago

Priority changed from Urgent to High

openqa_install_multimachine seems to be more aware, reducing to "High"

Actions

Copy link

Updated by mkittler 10 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 10 months ago

Not sure whether os-autoinst-setup-multi-machine really timed out (it generally seemed to work; I couldn't spot an error message or stuck command in the video). So maybe the problem is that the end marker 2HR0K- doesn't appear in the serial log.

Actions

Copy link

#10

Updated by openqa_review 10 months ago

Due date set to 2024-03-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#11

Updated by mkittler 10 months ago · Edited

I had a look at os-autoinst's code to see how assert_script_run works. Apparently it is using the same serial device that also ends up as serial0.txt log for the marker. And that file contains the marker:

[  495.644496][T19426] Failed to associated timeout policy `ovs_test_tp'
[  496.217552][T19426] tap137: entered promiscuous mode
2HR0K-0-
[  496.781360][T19426] tap138: entered promiscuous mode
[  497.539160][T19426] tap139: entered promiscuous mode

There were some other messages but the marker it left in one piece. So probably this is really just a time out - as the error message suggests.

The last output we see (in the video) from the setup script is the symlinking that corresponds to systemctl enable os-autoinst-openvswitch. So the script is already almost done (and not stuck somewhere in the middle). The only further command is systemctl restart openvswitch os-autoinst-openvswitch and most likely os-autoinst-openvswitch took too long to restart. The restart will really block until the service is ready because it is of type dbus so systemd will wait until it actually appears on the dbus - and that in turn won't happen until all the initialization logic in os-autoinst-openvswitch has happened. Considering we have already

[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"

as an override for this systemd unit in production it is fair to assume that 90 seconds for the whole os-autoinst-setup-multi-machine is probably not enough. So I'll be increasing the timeout.

Actions

Copy link

#12

Updated by mkittler 10 months ago

Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/163

Actions

Copy link

#13

Updated by mkittler 10 months ago

PR for fixing the worker module running into the screen saver: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/166

Actions

Copy link

#14

Updated by okurz 10 months ago

Status changed from Feedback to In Progress

I merged related https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/165 . https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/166 CI checks were failing due to the error fixed in https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/165 . I asked mergifyio to rebase. This still needs a second approval.

Actions

Copy link

#15

Updated by mkittler 10 months ago

Status changed from In Progress to Resolved

The PRs have been merged. With that we can consider the issue resolved. (I also created https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/167 but that's really not part of this issue.)

Actions

Copy link

#16

Updated by okurz 10 months ago

Status changed from Resolved to In Progress

ok, looks promising. But for a sporadic issue we should be strict and look for a statistically sound verification, especially since we currently do not have automatic notifications about failing openQA-in-openQA tests.

Actions

Copy link

#17

Updated by mkittler 10 months ago

Status changed from In Progress to Feedback

Actions

Copy link

#18

Updated by mkittler 10 months ago

Status changed from Feedback to Resolved

I haven't seen jobs failing anymore in openqa_worker (and also not in worker which is technically a different issue) for the last 3 days (after checking Next & Previous of all relevant scenarios) and lots of jobs haven been scheduled since then (over 80 alone in the scenario mentioned in the PR description).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #155173

[openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M

Observation¶

Reproducible¶

Expected result¶

Suggestions¶

Further details¶

Updated by jbaier_cz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago

Updated by openqa_review 10 months ago

Updated by mkittler 10 months ago · Edited

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago

Updated by okurz 10 months ago

Updated by mkittler 10 months ago

Updated by okurz 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago