action #106685
opencoordination #125708: [epic] Future ideas for more stable non-qemu backends
Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
0%
Description
Observation¶
https://openqa.suse.de/tests/8151113
is incomplete with
Reason: backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out
maybe related to #76813
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#106685
Suggestions¶
- SQL helps track down affected cases, but e.g. logs are still required to be dug into
- impossible to reproduce locally
- we apparently can't, but if hypothetically we could debug live, we could detect e.g. vnc issues
- maybe we can add a feature to record vnc issues?
- we can rule out all other cases?
- kill a vnc server on purpose to achieve a similar end result, but that's not the cause
- can we re-use vnc ssh connections?
- Cris doesn't understand this properly
- we rather not add stuff to production if we don't know if it's going to work
- let's brainstorm this in a mob session
Updated by okurz almost 3 years ago
- Copied from action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Updated by okurz almost 3 years ago
- Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out" to [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
Updated by mkittler almost 3 years ago
- Status changed from Blocked to Rejected
After having a closer look I come to the conclusion that this problem is actually identical to the cases mentioned in #76813.
Updated by okurz almost 3 years ago
I hope you checked if there are more labeled jobs due to auto-review
Updated by openqa_review almost 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/8306625
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz over 2 years ago
- Status changed from Rejected to Feedback
- Priority changed from Normal to High
@mkittler see? I told you in #106685#note-5 :)
Updated by okurz over 2 years ago
- Related to action #108953: [tools] Performance issues in some s390 workers added
Updated by okurz over 2 years ago
$ openqa-query-for-job-label poo#106685
8444946|2022-03-31 08:37:24|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.98:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8444920|2022-03-31 08:08:20|done|failed|qam-gnome||grenache-1
8439745|2022-03-30 22:58:02|done|failed|offline_sles15sp3_pscc_basesys-srv-lgm-pcm_def_full||grenache-1
8435977|2022-03-30 10:42:42|done|failed|qam-gnome:investigate:last_good_tests_and_build:33a80d6163959f20deaf10af84ebe3d65c87d31a+20220329-1||grenache-1
8434344|2022-03-30 02:43:55|done|failed|qam-gnome||grenache-1
8433768|2022-03-30 02:00:52|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8429711|2022-03-29 09:20:54|done|failed|qam-gnome||grenache-1
8421302|2022-03-29 05:43:56|done|incomplete|slem_installation_autoyast:investigate:last_good_tests:9876969163d82d1f820f023f4012e1f3a6317d73|backend died: Error connecting to VNC server <10.161.145.97:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8428299|2022-03-29 04:32:35|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8427268|2022-03-29 02:02:10|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
Updated by mkittler over 2 years ago
- Status changed from Feedback to New
- Assignee deleted (
mkittler)
Direct links to some of the jobs:
- https://openqa.suse.de/tests/8444946#step/await_install/1
- https://openqa.suse.de/tests/8427268#step/await_install/1
- https://openqa.suse.de/tests/8444920
Looks like it isn't only happening on grenache-1:
openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_by_host, count(*) total from finished where t_finished >= '2022-03-01' group by host order by ratio_by_host desc;
host | ratio_by_host | total
---------------------+---------------+-------
openqaworker2 | 4.16 | 10902
grenache-1 | 1.54 | 16384
automotive:1 | 0 | 1212
malbec | 0 | 2865
openqa-piworker | 0 | 11
openqaworker-arm-1 | 0 | 3978
openqaworker-arm-2 | 0 | 7675
openqaworker-arm-3 | 0 | 7456
openqaworker10 | 0 | 7866
openqaworker13 | 0 | 12213
openqaworker15 | 0 | 8
openqaworker3 | 0 | 12724
openqaworker5 | 0 | 21242
openqaworker6 | 0 | 19230
openqaworker8 | 0 | 13057
openqaworker9 | 0 | 11992
powerqaworker-qam-1 | 0 | 3212
QA-Power8-4-kvm | 0 | 4496
| 0 | 29477
QA-Power8-5-kvm | 0 | 4999
Some more details:
openqa=> select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' and t_finished >= '2022-03-01' order by t_finished limit 50;
id | t_finished | result | reason | worker
---------+---------------------+------------+---------------------------------------------------------------------------------------------------------------------------+---------------
8247241 | 2022-03-01 02:03:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.99:5901>: IO::Socket::INET: connect: Connection timed out | grenache-1
8250826 | 2022-03-01 19:42:08 | incomplete | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8250825 | 2022-03-01 19:42:26 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8251190 | 2022-03-01 20:07:17 | incomplete | backend died: Error connecting to VNC server <s390qa106.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8256086 | 2022-03-02 16:32:11 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8256499 | 2022-03-02 16:58:10 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8261667 | 2022-03-03 01:58:46 | failed | backend done: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out | grenache-1
8261669 | 2022-03-03 02:06:16 | incomplete | backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out | grenache-1
8265547 | 2022-03-03 13:43:14 | incomplete | backend died: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8277320 | 2022-03-06 01:50:09 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out | grenache-1
8280718 | 2022-03-07 02:06:48 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280740 | 2022-03-07 02:27:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280748 | 2022-03-07 02:50:09 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280757 | 2022-03-07 03:12:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280767 | 2022-03-07 03:35:05 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280774 | 2022-03-07 03:58:22 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280794 | 2022-03-07 04:20:27 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280834 | 2022-03-07 04:41:29 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280841 | 2022-03-07 05:02:34 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280848 | 2022-03-07 05:23:50 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280852 | 2022-03-07 05:44:36 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
8280864 | 2022-03-07 06:05:51 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
I'm still not sure why that is happening, similar to #76813.
Updated by livdywan over 2 years ago
- Assignee set to okurz
@okurz volunteered to look into the ticket and ponder some ideas
Updated by okurz over 2 years ago
- Copied to action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:M added
Updated by okurz over 2 years ago
- Status changed from New to Blocked
Some more statistics:
openqa=> select count(*), machine from jobs where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' group by machine order by count desc;
count | machine
-------+-------------------------
354 | s390x-zVM-Upgrade-m1
297 | s390x-kvm-sle12
58 | s390x-zVM-vswitch-l2
55 | s390x-zVM-vswitch-l3
23 | svirt-hyperv
21 | svirt-hyperv-uefi
16 | s390x-kvm-sle15
9 | ppc64le-hmc-single-disk
7 | svirt-xen-pv
5 | svirt-xen-hvm
1 | svirt-hyperv2012r2-uefi
1 | svirt-hyperv2016
1 | svirt-hyperv2016-uefi
1 | ipmi-64bit-mlx_con5
(14 rows)
so mostly s390x, both z/VM as well as kvm. We have to start somewhere and as https://app.codecov.io/gh/os-autoinst/os-autoinst/blob/master/backend/svirt.pm shows that we don't have good statement coverage let's start with the comparatively easy and save task of increasing test coverage of backend/svirt.pm before doing any further changes, even if it's just low-risk logging enhancements.
-> Blocking by #109620
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: offline_sles15sp3_pscc_base_all_minimal
https://openqa.suse.de/tests/8570852#step/reconnect_mgmt_console/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz over 2 years ago
- Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry to Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
- Status changed from Blocked to New
- Assignee deleted (
okurz)
100% statement coverage reached within #109620. We can now look into this ticket again with proper unit test coverage as a safety net.
Updated by okurz over 2 years ago
- Target version changed from Ready to future
So we have improved unit test coverage for the relevant backend file which can help any contributor. Right now I don't see what we can do immediately ourselves so removing again from the backlog of SUSE QE Tools. Free to pickup for any contributor.
Updated by okurz almost 2 years ago
- Parent task changed from #109656 to #125708