Project

General

Profile

action #106685

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #109656: [epic] Stable non-qemu backends

Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry

Added by okurz 4 months ago. Updated 23 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/8151113
is incomplete with

Reason: backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out

maybe related to #76813

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#106685

Suggestions

  • SQL helps track down affected cases, but e.g. logs are still required to be dug into
  • impossible to reproduce locally
  • we apparently can't, but if hypothetically we could debug live, we could detect e.g. vnc issues
    • maybe we can add a feature to record vnc issues?
    • we can rule out all other cases?
    • kill a vnc server on purpose to achieve a similar end result, but that's not the cause
    • can we re-use vnc ssh connections?
      • Cris doesn't understand this properly
      • we rather not add stuff to production if we don't know if it's going to work
      • let's brainstorm this in a mob session

Related issues

Related to openQA Tests - action #108953: [tools] Performance issues in some s390 workersResolved2022-03-25

Copied from openQA Project - action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused":retryNew2020-10-30

Copied to openQA Project - action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:MResolved

History

#1 Updated by okurz 4 months ago

  • Copied from action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused":retry added

#2 Updated by okurz 4 months ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to mkittler

mkittler this could be related to #76813 so assigning to you to know about it and track as "blocked" until #76813 is resolved.

#3 Updated by okurz 4 months ago

  • Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out" to [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry

#4 Updated by mkittler 4 months ago

  • Status changed from Blocked to Rejected

After having a closer look I come to the conclusion that this problem is actually identical to the cases mentioned in #76813.

#5 Updated by okurz 4 months ago

I hope you checked if there are more labeled jobs due to auto-review

#6 Updated by openqa_review 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/8306625

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#7 Updated by okurz 3 months ago

  • Status changed from Rejected to Feedback
  • Priority changed from Normal to High

mkittler see? I told you in #106685#note-5 :)

#8 Updated by okurz 3 months ago

  • Related to action #108953: [tools] Performance issues in some s390 workers added

#9 Updated by okurz 3 months ago

$ openqa-query-for-job-label poo#106685
8444946|2022-03-31 08:37:24|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.98:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8444920|2022-03-31 08:08:20|done|failed|qam-gnome||grenache-1
8439745|2022-03-30 22:58:02|done|failed|offline_sles15sp3_pscc_basesys-srv-lgm-pcm_def_full||grenache-1
8435977|2022-03-30 10:42:42|done|failed|qam-gnome:investigate:last_good_tests_and_build:33a80d6163959f20deaf10af84ebe3d65c87d31a+20220329-1||grenache-1
8434344|2022-03-30 02:43:55|done|failed|qam-gnome||grenache-1
8433768|2022-03-30 02:00:52|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8429711|2022-03-29 09:20:54|done|failed|qam-gnome||grenache-1
8421302|2022-03-29 05:43:56|done|incomplete|slem_installation_autoyast:investigate:last_good_tests:9876969163d82d1f820f023f4012e1f3a6317d73|backend died: Error connecting to VNC server <10.161.145.97:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8428299|2022-03-29 04:32:35|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8427268|2022-03-29 02:02:10|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1

#10 Updated by mkittler 3 months ago

  • Status changed from Feedback to New
  • Assignee deleted (mkittler)

Direct links to some of the jobs:

Looks like it isn't only happening on grenache-1:

openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_by_host, count(*) total from finished where t_finished >= '2022-03-01' group by host order by ratio_by_host desc;
        host         | ratio_by_host | total 
---------------------+---------------+-------
 openqaworker2       |          4.16 | 10902
 grenache-1          |          1.54 | 16384
 automotive:1        |             0 |  1212
 malbec              |             0 |  2865
 openqa-piworker     |             0 |    11
 openqaworker-arm-1  |             0 |  3978
 openqaworker-arm-2  |             0 |  7675
 openqaworker-arm-3  |             0 |  7456
 openqaworker10      |             0 |  7866
 openqaworker13      |             0 | 12213
 openqaworker15      |             0 |     8
 openqaworker3       |             0 | 12724
 openqaworker5       |             0 | 21242
 openqaworker6       |             0 | 19230
 openqaworker8       |             0 | 13057
 openqaworker9       |             0 | 11992
 powerqaworker-qam-1 |             0 |  3212
 QA-Power8-4-kvm     |             0 |  4496
                     |             0 | 29477
 QA-Power8-5-kvm     |             0 |  4999

Some more details:

openqa=> select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' and t_finished >= '2022-03-01' order by t_finished limit 50;
   id    |     t_finished      |   result   |                                                          reason                                                           |    worker     
---------+---------------------+------------+---------------------------------------------------------------------------------------------------------------------------+---------------
 8247241 | 2022-03-01 02:03:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.99:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8250826 | 2022-03-01 19:42:08 | incomplete | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8250825 | 2022-03-01 19:42:26 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8251190 | 2022-03-01 20:07:17 | incomplete | backend died: Error connecting to VNC server <s390qa106.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8256086 | 2022-03-02 16:32:11 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8256499 | 2022-03-02 16:58:10 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8261667 | 2022-03-03 01:58:46 | failed     | backend done: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8261669 | 2022-03-03 02:06:16 | incomplete | backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8265547 | 2022-03-03 13:43:14 | incomplete | backend died: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8277320 | 2022-03-06 01:50:09 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8280718 | 2022-03-07 02:06:48 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280740 | 2022-03-07 02:27:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280748 | 2022-03-07 02:50:09 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280757 | 2022-03-07 03:12:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280767 | 2022-03-07 03:35:05 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280774 | 2022-03-07 03:58:22 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280794 | 2022-03-07 04:20:27 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280834 | 2022-03-07 04:41:29 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280841 | 2022-03-07 05:02:34 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280848 | 2022-03-07 05:23:50 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280852 | 2022-03-07 05:44:36 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280864 | 2022-03-07 06:05:51 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2

I'm still not sure why that is happening, similar to #76813.

#11 Updated by cdywan 3 months ago

  • Description updated (diff)

#12 Updated by cdywan 3 months ago

  • Assignee set to okurz

okurz volunteered to look into the ticket and ponder some ideas

#13 Updated by okurz 3 months ago

  • Copied to action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:M added

#14 Updated by okurz 3 months ago

  • Status changed from New to Blocked

Some more statistics:

openqa=> select count(*), machine from jobs where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' group by machine order by count desc;
 count |         machine         
-------+-------------------------
   354 | s390x-zVM-Upgrade-m1
   297 | s390x-kvm-sle12
    58 | s390x-zVM-vswitch-l2
    55 | s390x-zVM-vswitch-l3
    23 | svirt-hyperv
    21 | svirt-hyperv-uefi
    16 | s390x-kvm-sle15
     9 | ppc64le-hmc-single-disk
     7 | svirt-xen-pv
     5 | svirt-xen-hvm
     1 | svirt-hyperv2012r2-uefi
     1 | svirt-hyperv2016
     1 | svirt-hyperv2016-uefi
     1 | ipmi-64bit-mlx_con5
(14 rows)

so mostly s390x, both z/VM as well as kvm. We have to start somewhere and as https://app.codecov.io/gh/os-autoinst/os-autoinst/blob/master/backend/svirt.pm shows that we don't have good statement coverage let's start with the comparatively easy and save task of increasing test coverage of backend/svirt.pm before doing any further changes, even if it's just low-risk logging enhancements.

-> Blocking by #109620

#15 Updated by okurz 3 months ago

  • Parent task set to #109656

#16 Updated by openqa_review 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp3_pscc_base_all_minimal
https://openqa.suse.de/tests/8570852#step/reconnect_mgmt_console/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#17 Updated by okurz about 1 month ago

  • Priority changed from High to Normal

#18 Updated by okurz about 1 month ago

  • Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry to Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
  • Status changed from Blocked to New
  • Assignee deleted (okurz)

100% statement coverage reached within #109620. We can now look into this ticket again with proper unit test coverage as a safety net.

#19 Updated by okurz 23 days ago

  • Target version changed from Ready to future

So we have improved unit test coverage for the relevant backend file which can help any contributor. Right now I don't see what we can do immediately ourselves so removing again from the backlog of SUSE QE Tools. Free to pickup for any contributor.

Also available in: Atom PDF