action #154357: [sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x - qe-yam - openSUSE Project Management Tool

Actions

Copy link

action #154357

closed

coordination #151816: [epic] Handle openQA fixes and job group setup

[sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x

Added by rainerkoenig over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

lmanfredi

Target version:

Start date:

2024-01-26

Due date:

% Done:

Estimated time:

Tags:

qe-yam-apr-sprint

Description

Motivation¶

In SLE Micro Maintenance Updates we sporadically experience incomplete slem_installation_autoyast tests.
Example: https://openqa.suse.de/tests/13356734

OpenQA gives the following reason:

Reason: backend died: Error connecting to VNC server <s390kvm099.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host

Scope:¶

SLE Micro YaST & Migration Maintenance Updates job group (ID=535)
s390x architecture

Acceptance criteria:¶

AC1: Root cause is found and test become stable again.

Additional info¶

According to @JERiveraMoya this is not an infrastructure issue:
"it is related with some installation problem or whatever other problem, and then when we try to reconnect to the sut it is not possible because it is not in good shape to reconnect, so unrelated with infra."
Most likely some timing issues in some point.

Actions

Copy link

Updated by JERiveraMoya over 1 year ago

Tags set to qe-yam-feb-sprint
Description updated (diff)
Status changed from New to Workable
Parent task set to #151816

Actions

Copy link

Updated by lmanfredi over 1 year ago

Status changed from Workable to In Progress
Assignee set to lmanfredi

Actions

Copy link

Updated by lmanfredi over 1 year ago

The issue seems that does not happen anymore in latest builds.

Actions

Copy link

Updated by lmanfredi over 1 year ago

Doing 100 runs, the issue seems there with a rate of 14%. See VRs

Actions

Copy link

Updated by lmanfredi over 1 year ago

By setting a greater TIMEOUT_SCALE it seems that issues disappears. See VRs

Actions

Copy link

Updated by lmanfredi over 1 year ago

Created openqa-job-groups MR#85

Actions

Copy link

Updated by leli over 1 year ago · Edited

I guess this issue may come from ssh access issue, please try to enable ssh for firewall and also enable sshd service.

Ex: autoyast/support_images/sles15sp5_install_textmode_default_patterns_s390x.xml

<services t="list"> <service>dhcpv6-client</service> <service>ssh</service> <service>tigervnc</service> <service>tigervnc-https</service> </services>

<services-manager t="map"> <default_target>multi-user</default_target> <services t="map"> <disable t="list"/> <enable t="list"> <service>firewalld</service> <service>wicked</service> <service>kdump</service> <service>kdump-early</service> <service>systemd-remount-fs</service> <service>sshd</service> </enable> </services> </services-manager>

Actions

Copy link

Updated by JERiveraMoya over 1 year ago

Subject changed from Investigate sporadic failure on AutoYaST SLE Micro installation on s390x to [sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x

Actions

Copy link

Updated by JERiveraMoya over 1 year ago

while we still figure out how to fix it, it would be great to avoid extra work to reviewer to set RETRY: 3 and remove it once the solution is found.

Actions

Copy link

#10

Updated by lmanfredi over 1 year ago · Edited

Added RETRY: 3 and increased TIMEOUT_SCALE to 15. Now v5.4 works fine.
See VRs
See MR#85

Actions

Copy link

#11

Updated by JERiveraMoya over 1 year ago

that is not a solution we should accept, bumping timeout is never good practice, you can only bump it a bit and with some good reason.
please see my comment in https://gitlab.suse.de/qe-yam/openqa-job-groups/-/merge_requests/85#note_596902

Actions

Copy link

#12

Updated by JERiveraMoya over 1 year ago

Tags changed from qe-yam-feb-sprint to qe-yam-mar-sprint

Actions

Copy link

#13

Updated by JERiveraMoya about 1 year ago

we still need to narrow the issue here, we narrow the product where it happens but not the automation code that handles it.

Actions

Copy link

#14

Updated by JERiveraMoya about 1 year ago

did you find in autoyast/installation.pm anything to improve here (to increase the timeout in some specific point with some specific condition)?
Perhaps if you point me to the last line executed I can take a closer look?

Actions

Copy link

#15

Updated by lmanfredi about 1 year ago · Edited

It seems that in the latest builds does not show the sporadic failures:

Here a VRs to check the behavior with TIMEOUT_SCALE.

Actions

Copy link

#16

Updated by JERiveraMoya about 1 year ago

lmanfredi wrote in #note-15:

It seems that in the latest builds does not show the sporadic failures:

build=20240307-1

build=20240308-1

build=20240310-1

build=20240311-1

build=20240312-1

Here a VRs to check the behavior with TIMEOUT_SCALE.

For verifications you need 10 or more, 4 days ago the error is present in the history of the job in the description of this ticket.
The problem is most likely related with not matching this needle:
https://openqa.suse.de/tests/13774275#step/installation/3
You need to find this point in the code and discuss with the squad about possibilities. Unfortunately our AutoYaST logic rely on that needle to do some actions afir.

Actions

Copy link

#17

Updated by lmanfredi about 1 year ago

Yes, I agree that the problem could be the mismatch for some needles.
At beginning it was seems that the problem was related with the needle having tag import-untrusted-gpg-key.
After I have excluded that, it is happen the problem for the the current one autoyast-stage1-reboot-upcoming-pvm :

ERROR - search: out of range 769 768 104 1024
[2024-03-14T18:25:22.850296+01:00] [warn] [pid:67647] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 4.25 seconds for 98 candidate needles - make your needles more specific
[2024-03-14T18:25:22.850390+01:00] [debug] [pid:67647] no match: 30.8s, best candidate: autoyast-stage1-reboot-upcoming-pvm-20200519 (0.00)
[2024-03-14T18:25:22.851431+01:00] [debug] [pid:67647] considering VNC stalled, no update for 5.25 seconds
[2024-03-14T18:26:31.040729+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
[2024-03-14T18:26:35.104441+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host

Actions

Copy link

#18

Updated by lmanfredi about 1 year ago · Edited

Added needle:

with tag package-notification

Needle removed due that causes failures in other tests suite

Actions

Copy link

#19

Updated by lmanfredi about 1 year ago

Created WIP PR#18943 to add some debug info in VRs.

It seems that there are two types of issue:

that seems not related with the needles mismatch, but instead to some kind of random network issue.

Actions

Copy link

#20

Updated by lmanfredi about 1 year ago

From debug info, seems that only one needle with tag autoyast-stage1-reboot-upcoming matches:

[2024-03-25T12:11:40.716282+01:00] [debug] [pid:74221] [installation::_debug_needles] $needles is:
  $VAR1 = {
            'error' => '0',
            'needle' => bless( {
                                 'tags' => [
                                             'ENV-ARCH-ppc64le',
                                             'autoyast-stage1-reboot-upcoming'
                                           ],
                                 'name' => 'autoyast-stage1-reboot-upcoming-pvm-20220318',
                                 'properties' => [],
                                 'area' => [
                                             {
                                               'ypos' => 378,
                                               'width' => 106,
                                               'xpos' => 362,
                                               'margin' => 50,
                                               'height' => 18,
                                               'type' => 'match'
                                             }
                                           ],
                                 'file' => 'autoyast-stage1-reboot-upcoming-pvm-20220318.json',
                                 'png' => 'needles/autoyast-stage1-reboot-upcoming-pvm-20220318.png'
                               }, 'needle' ),
            'area' => [
                        {
                          'x' => 386,
                          'w' => 106,
                          'result' => 'ok',
                          'similarity' => '1',
                          'y' => 378,
                          'h' => 18
                        }
                      ],
            'ok' => 1
          };

Actions

Copy link

#21

Updated by JERiveraMoya about 1 year ago

Tags changed from qe-yam-mar-sprint to qe-yam-apr-sprint

Actions

Copy link

#22

Updated by lmanfredi about 1 year ago

By running again the VRs here we have maybe just a sporadic network issue.
See slack comments. E.g.

https://openqa.suse.de/tests/13875587
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm082.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875571
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875550
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875541
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm087.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875539
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm086.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13857739
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host

Actions

Copy link

#23

Updated by JERiveraMoya about 1 year ago

could you please paste the last failures of this problem?
It looks like doesn't happen since long time and we already investigated enough and resolving this ticket should be fine.

Actions

Copy link

#24

Updated by lmanfredi about 1 year ago · Edited

From latest builds:

there is only one incomplete for build 20240405-1

Actions

Copy link

#25

Updated by JERiveraMoya about 1 year ago

Status changed from In Progress to Resolved

Actions

Copy link

#26

Updated by lmanfredi about 1 year ago

Closed MR#85

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public) » qe-yam

Tags

Custom queries

action #154357

[sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x

Motivation¶

Scope:¶

Acceptance criteria:¶

Additional info¶

Updated by JERiveraMoya over 1 year ago

Updated by lmanfredi over 1 year ago

Updated by lmanfredi over 1 year ago

Updated by lmanfredi over 1 year ago

Updated by lmanfredi over 1 year ago

Updated by lmanfredi over 1 year ago

Updated by leli over 1 year ago · Edited

Updated by JERiveraMoya over 1 year ago

Updated by JERiveraMoya over 1 year ago

Updated by lmanfredi over 1 year ago · Edited

Updated by JERiveraMoya over 1 year ago

Updated by JERiveraMoya over 1 year ago

Updated by JERiveraMoya about 1 year ago

Updated by JERiveraMoya about 1 year ago

Updated by lmanfredi about 1 year ago · Edited

Updated by JERiveraMoya about 1 year ago

Updated by lmanfredi about 1 year ago

Updated by lmanfredi about 1 year ago · Edited

Updated by lmanfredi about 1 year ago

Updated by lmanfredi about 1 year ago

Updated by JERiveraMoya about 1 year ago

Updated by lmanfredi about 1 year ago

Updated by JERiveraMoya about 1 year ago

Updated by lmanfredi about 1 year ago · Edited

Updated by JERiveraMoya about 1 year ago

Updated by lmanfredi about 1 year ago