Project

General

Profile

action #109028

[openqa][worker][sut] Very severe stability and connectivity issues of openqa workers and suts

Added by waynechen55 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

Recent openqa environment for virtualization test run is really bad. I think it already becomes intolerable. Under current circumstances, virtualization functional test run can not finish completely and in time if any testing task with time constraints coming in.

Steps to reproduce

  • Observe a newly triggered test run with a new daily build
  • Manual rerun all failed test runs constantly

Problem

  • It seems that only general environment issue can have such widespread and severe impact on openqa test run
  • I am also aware of poo#108845. Not sure whether it is relevant.

Suggestion

  • Check environment issues in openqa network, including sut machine status, server room situation, infrastructure connectivity and stability, network glitch and etc.
  • Check openqa worker status and the machine on which openqa worker is running, especially openqaworker-2:18 and openqaworker-2:20. Maybe they need maintenance service.

Workaround

Even rerun can not help improve current situation.


Related issues

Related to openQA Infrastructure - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:MResolved2022-03-242022-04-15

History

#1 Updated by waynechen55 3 months ago

This is related to https://progress.opensuse.org/issues/108764

From what I have seen till now, all failures are related to:
boot_form_pxe
scc_registration
Failed due to "incomplete" (worker can not establish ipmi connection to sut constantly)

#2 Updated by okurz 3 months ago

  • Related to action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M added

#3 Updated by okurz 3 months ago

  • Target version set to Ready

Please put more details into the ticket description from job details. I linked a related ticket. This is likely due to a network problem in the QA network. I recommend that you use https://github.com/os-autoinst/scripts#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to detect such issues and retry if you see that this helps

#4 Updated by okurz 3 months ago

  • Priority changed from High to Urgent

#5 Updated by okurz 3 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

This might as well be just related to #108845 and QA (related) switches. Due to recent development e.g. in #108845#note-21 I will take this ticket and track it.

#6 Updated by okurz 3 months ago

  • Due date set to 2022-04-01

#7 Updated by okurz 3 months ago

  • Due date changed from 2022-04-01 to 2022-03-31

#8 Updated by okurz 3 months ago

  • Due date deleted (2022-03-31)

#9 Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.

#10 Updated by waynechen55 3 months ago

okurz wrote:

After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.

Thanks for your great help.

Also available in: Atom PDF