action #109028
closed[openqa][worker][sut] Very severe stability and connectivity issues of openqa workers and suts
0%
Description
Observation¶
Recent openqa environment for virtualization test run is really bad. I think it already becomes intolerable. Under current circumstances, virtualization functional test run can not finish completely and in time if any testing task with time constraints coming in.
- The last daily Build116.4 has not finished acceptance test run after lots of rerun.
- For any unfinished test suite on Build116.4 acceptance test run page, there are pages of rerun history for the test suite, for example, this one.
- I believe there is general environment issue, but openqaworker-2:18 and openqaworker-2:20 are the two which are being affected the most from observation. It seems that they nearly can not finish any test run assigned to them even with rerun.
Steps to reproduce¶
- Observe a newly triggered test run with a new daily build
- Manual rerun all failed test runs constantly
Problem¶
- It seems that only general environment issue can have such widespread and severe impact on openqa test run
- I am also aware of poo#108845. Not sure whether it is relevant.
Suggestion¶
- Check environment issues in openqa network, including sut machine status, server room situation, infrastructure connectivity and stability, network glitch and etc.
- Check openqa worker status and the machine on which openqa worker is running, especially openqaworker-2:18 and openqaworker-2:20. Maybe they need maintenance service.
Workaround¶
Even rerun can not help improve current situation.
Updated by waynechen55 over 2 years ago
This is related to https://progress.opensuse.org/issues/108764
From what I have seen till now, all failures are related to:
boot_form_pxe
scc_registration
Failed due to "incomplete" (worker can not establish ipmi connection to sut constantly)
Updated by okurz over 2 years ago
- Related to action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M added
Updated by okurz over 2 years ago
- Target version set to Ready
Please put more details into the ticket description from job details. I linked a related ticket. This is likely due to a network problem in the QA network. I recommend that you use https://github.com/os-autoinst/scripts#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to detect such issues and retry if you see that this helps
Updated by okurz over 2 years ago
- Status changed from New to Blocked
- Assignee set to okurz
This might as well be just related to #108845 and QA (related) switches. Due to recent development e.g. in #108845#note-21 I will take this ticket and track it.
Updated by okurz over 2 years ago
- Due date changed from 2022-04-01 to 2022-03-31
Updated by okurz over 2 years ago
- Status changed from Blocked to Resolved
After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.
Updated by waynechen55 over 2 years ago
okurz wrote:
After #108845 got resolved and checking results I find that https://openqa.suse.de/group_overview/263 looks much better now. Also the history of jobs on the mentioned workers https://openqa.suse.de/admin/workers/1250 and https://openqa.suse.de/admin/workers/376 look mostly green so I assume that the underlying problem was actually the same, within the EngInfra maintained network infrastructure.
Thanks for your great help.