Project

General

Profile

Actions

action #106685

open

coordination #125708: [epic] Future ideas for more stable non-qemu backends

Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry

Added by okurz over 2 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/8151113
is incomplete with

Reason: backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out

maybe related to #76813

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#106685

Suggestions

  • SQL helps track down affected cases, but e.g. logs are still required to be dug into
  • impossible to reproduce locally
  • we apparently can't, but if hypothetically we could debug live, we could detect e.g. vnc issues
    • maybe we can add a feature to record vnc issues?
    • we can rule out all other cases?
    • kill a vnc server on purpose to achieve a similar end result, but that's not the cause
    • can we re-use vnc ssh connections?
      • Cris doesn't understand this properly
      • we rather not add stuff to production if we don't know if it's going to work
      • let's brainstorm this in a mob session

Related issues 3 (1 open2 closed)

Related to openQA Tests - action #108953: [tools] Performance issues in some s390 workersResolvedokurz2022-03-25

Actions
Copied from openQA Project - action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused"New2020-10-30

Actions
Copied to openQA Project - action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:MResolvedosukup

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Actions #2

Updated by okurz over 2 years ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to mkittler

@mkittler this could be related to #76813 so assigning to you to know about it and track as "blocked" until #76813 is resolved.

Actions #3

Updated by okurz over 2 years ago

  • Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out" to [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
Actions #4

Updated by mkittler over 2 years ago

  • Status changed from Blocked to Rejected

After having a closer look I come to the conclusion that this problem is actually identical to the cases mentioned in #76813.

Actions #5

Updated by okurz over 2 years ago

I hope you checked if there are more labeled jobs due to auto-review

Actions #6

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/8306625

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #7

Updated by okurz about 2 years ago

  • Status changed from Rejected to Feedback
  • Priority changed from Normal to High

@mkittler see? I told you in #106685#note-5 :)

Actions #8

Updated by okurz about 2 years ago

  • Related to action #108953: [tools] Performance issues in some s390 workers added
Actions #9

Updated by okurz about 2 years ago

$ openqa-query-for-job-label poo#106685
8444946|2022-03-31 08:37:24|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.98:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8444920|2022-03-31 08:08:20|done|failed|qam-gnome||grenache-1
8439745|2022-03-30 22:58:02|done|failed|offline_sles15sp3_pscc_basesys-srv-lgm-pcm_def_full||grenache-1
8435977|2022-03-30 10:42:42|done|failed|qam-gnome:investigate:last_good_tests_and_build:33a80d6163959f20deaf10af84ebe3d65c87d31a+20220329-1||grenache-1
8434344|2022-03-30 02:43:55|done|failed|qam-gnome||grenache-1
8433768|2022-03-30 02:00:52|done|incomplete|qam-minimal+base|backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8429711|2022-03-29 09:20:54|done|failed|qam-gnome||grenache-1
8421302|2022-03-29 05:43:56|done|incomplete|slem_installation_autoyast:investigate:last_good_tests:9876969163d82d1f820f023f4012e1f3a6317d73|backend died: Error connecting to VNC server <10.161.145.97:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8428299|2022-03-29 04:32:35|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
8427268|2022-03-29 02:02:10|done|incomplete|mru-install-minimal-with-addons|backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: Connection timed out|grenache-1
Actions #10

Updated by mkittler about 2 years ago

  • Status changed from Feedback to New
  • Assignee deleted (mkittler)

Direct links to some of the jobs:

Looks like it isn't only happening on grenache-1:

openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_by_host, count(*) total from finished where t_finished >= '2022-03-01' group by host order by ratio_by_host desc;
        host         | ratio_by_host | total 
---------------------+---------------+-------
 openqaworker2       |          4.16 | 10902
 grenache-1          |          1.54 | 16384
 automotive:1        |             0 |  1212
 malbec              |             0 |  2865
 openqa-piworker     |             0 |    11
 openqaworker-arm-1  |             0 |  3978
 openqaworker-arm-2  |             0 |  7675
 openqaworker-arm-3  |             0 |  7456
 openqaworker10      |             0 |  7866
 openqaworker13      |             0 | 12213
 openqaworker15      |             0 |     8
 openqaworker3       |             0 | 12724
 openqaworker5       |             0 | 21242
 openqaworker6       |             0 | 19230
 openqaworker8       |             0 | 13057
 openqaworker9       |             0 | 11992
 powerqaworker-qam-1 |             0 |  3212
 QA-Power8-4-kvm     |             0 |  4496
                     |             0 | 29477
 QA-Power8-5-kvm     |             0 |  4999

Some more details:

openqa=> select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' and t_finished >= '2022-03-01' order by t_finished limit 50;
   id    |     t_finished      |   result   |                                                          reason                                                           |    worker     
---------+---------------------+------------+---------------------------------------------------------------------------------------------------------------------------+---------------
 8247241 | 2022-03-01 02:03:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.99:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8250826 | 2022-03-01 19:42:08 | incomplete | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8250825 | 2022-03-01 19:42:26 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8251190 | 2022-03-01 20:07:17 | incomplete | backend died: Error connecting to VNC server <s390qa106.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8256086 | 2022-03-02 16:32:11 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8256499 | 2022-03-02 16:58:10 | incomplete | backend died: Error connecting to VNC server <s390qa104.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8261667 | 2022-03-03 01:58:46 | failed     | backend done: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8261669 | 2022-03-03 02:06:16 | incomplete | backend died: Error connecting to VNC server <10.161.145.85:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8265547 | 2022-03-03 13:43:14 | incomplete | backend died: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8277320 | 2022-03-06 01:50:09 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out        | grenache-1
 8280718 | 2022-03-07 02:06:48 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280740 | 2022-03-07 02:27:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280748 | 2022-03-07 02:50:09 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280757 | 2022-03-07 03:12:35 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280767 | 2022-03-07 03:35:05 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280774 | 2022-03-07 03:58:22 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280794 | 2022-03-07 04:20:27 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280834 | 2022-03-07 04:41:29 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280841 | 2022-03-07 05:02:34 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280848 | 2022-03-07 05:23:50 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280852 | 2022-03-07 05:44:36 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2
 8280864 | 2022-03-07 06:05:51 | incomplete | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out | openqaworker2

I'm still not sure why that is happening, similar to #76813.

Actions #11

Updated by livdywan about 2 years ago

  • Description updated (diff)
Actions #12

Updated by livdywan about 2 years ago

  • Assignee set to okurz

@okurz volunteered to look into the ticket and ponder some ideas

Actions #13

Updated by okurz about 2 years ago

  • Copied to action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:M added
Actions #14

Updated by okurz about 2 years ago

  • Status changed from New to Blocked

Some more statistics:

openqa=> select count(*), machine from jobs where (result='failed' or result='incomplete') and reason like '%Error connecting to VNC server%IO::Socket::INET: connect: Connection timed out%' group by machine order by count desc;
 count |         machine         
-------+-------------------------
   354 | s390x-zVM-Upgrade-m1
   297 | s390x-kvm-sle12
    58 | s390x-zVM-vswitch-l2
    55 | s390x-zVM-vswitch-l3
    23 | svirt-hyperv
    21 | svirt-hyperv-uefi
    16 | s390x-kvm-sle15
     9 | ppc64le-hmc-single-disk
     7 | svirt-xen-pv
     5 | svirt-xen-hvm
     1 | svirt-hyperv2012r2-uefi
     1 | svirt-hyperv2016
     1 | svirt-hyperv2016-uefi
     1 | ipmi-64bit-mlx_con5
(14 rows)

so mostly s390x, both z/VM as well as kvm. We have to start somewhere and as https://app.codecov.io/gh/os-autoinst/os-autoinst/blob/master/backend/svirt.pm shows that we don't have good statement coverage let's start with the comparatively easy and save task of increasing test coverage of backend/svirt.pm before doing any further changes, even if it's just low-risk logging enhancements.

-> Blocking by #109620

Actions #15

Updated by okurz about 2 years ago

  • Parent task set to #109656
Actions #16

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp3_pscc_base_all_minimal
https://openqa.suse.de/tests/8570852#step/reconnect_mgmt_console/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #17

Updated by okurz about 2 years ago

  • Priority changed from High to Normal
Actions #18

Updated by okurz about 2 years ago

  • Subject changed from [tools] Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry to Test using svirt backend incomplete with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection timed out":retry
  • Status changed from Blocked to New
  • Assignee deleted (okurz)

100% statement coverage reached within #109620. We can now look into this ticket again with proper unit test coverage as a safety net.

Actions #19

Updated by okurz almost 2 years ago

  • Target version changed from Ready to future

So we have improved unit test coverage for the relevant backend file which can help any contributor. Right now I don't see what we can do immediately ourselves so removing again from the backlog of SUSE QE Tools. Free to pickup for any contributor.

Actions #20

Updated by okurz about 1 year ago

  • Parent task changed from #109656 to #125708
Actions

Also available in: Atom PDF