action #109028: [openqa][worker][sut] Very severe stability and connectivity issues of openqa workers and suts - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #109028

closed

[openqa][worker][sut] Very severe stability and connectivity issues of openqa workers and suts

Added by waynechen55 about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-03-28

Due date:

% Done:

Estimated time:

Description

Observation¶

Recent openqa environment for virtualization test run is really bad. I think it already becomes intolerable. Under current circumstances, virtualization functional test run can not finish completely and in time if any testing task with time constraints coming in.

The last daily Build116.4 has not finished acceptance test run after lots of rerun.
For any unfinished test suite on Build116.4 acceptance test run page, there are pages of rerun history for the test suite, for example, this one.
I believe there is general environment issue, but openqaworker-2:18 and openqaworker-2:20 are the two which are being affected the most from observation. It seems that they nearly can not finish any test run assigned to them even with rerun.

Steps to reproduce¶

Observe a newly triggered test run with a new daily build
Manual rerun all failed test runs constantly

Problem¶

It seems that only general environment issue can have such widespread and severe impact on openqa test run
I am also aware of poo#108845. Not sure whether it is relevant.

Suggestion¶

Check environment issues in openqa network, including sut machine status, server room situation, infrastructure connectivity and stability, network glitch and etc.
Check openqa worker status and the machine on which openqa worker is running, especially openqaworker-2:18 and openqaworker-2:20. Maybe they need maintenance service.

Workaround¶

Even rerun can not help improve current situation.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by waynechen55 about 3 years ago

This is related to https://progress.opensuse.org/issues/108764

From what I have seen till now, all failures are related to:
boot_form_pxe
scc_registration
Failed due to "incomplete" (worker can not establish ipmi connection to sut constantly)

Actions

Copy link

Updated by okurz about 3 years ago

Related to action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M added

Actions

Copy link

Updated by okurz about 3 years ago

Target version set to Ready

Please put more details into the ticket description from job details. I linked a related ticket. This is likely due to a network problem in the QA network. I recommend that you use https://github.com/os-autoinst/scripts#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to detect such issues and retry if you see that this helps

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from New to Blocked
Assignee set to okurz

This might as well be just related to #108845 and QA (related) switches. Due to recent development e.g. in #108845#note-21 I will take this ticket and track it.

Actions

Copy link

Updated by okurz about 3 years ago

Due date set to 2022-04-01

Actions

Copy link

Updated by okurz about 3 years ago

Due date changed from 2022-04-01 to 2022-03-31

Actions

Copy link

Updated by okurz about 3 years ago

Due date deleted (~~2022-03-31~~)

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Blocked to Resolved

After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.

Actions

Copy link

#10

Updated by waynechen55 about 3 years ago

okurz wrote:

After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.

Thanks for your great help.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #109028

[openqa][worker][sut] Very severe stability and connectivity issues of openqa workers and suts

Observation¶

Steps to reproduce¶

Problem¶

Suggestion¶

Workaround¶

Updated by waynechen55 about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by waynechen55 about 3 years ago