Project

General

Profile

Actions

action #154357

closed

coordination #151816: [epic] Handle openQA fixes and job group setup

[sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x

Added by rainerkoenig 4 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
2024-01-26
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In SLE Micro Maintenance Updates we sporadically experience incomplete slem_installation_autoyast tests.
Example: https://openqa.suse.de/tests/13356734

OpenQA gives the following reason:

Reason: backend died: Error connecting to VNC server <s390kvm099.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host

Scope:

  • SLE Micro YaST & Migration Maintenance Updates job group (ID=535)
  • s390x architecture

Acceptance criteria:

  • AC1: Root cause is found and test become stable again.

Additional info

According to @JERiveraMoya this is not an infrastructure issue:
"it is related with some installation problem or whatever other problem, and then when we try to reconnect to the sut it is not possible because it is not in good shape to reconnect, so unrelated with infra."
Most likely some timing issues in some point.

Actions #1

Updated by JERiveraMoya 4 months ago

  • Tags set to qe-yam-feb-sprint
  • Description updated (diff)
  • Status changed from New to Workable
  • Parent task set to #151816
Actions #2

Updated by lmanfredi 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to lmanfredi
Actions #3

Updated by lmanfredi 3 months ago

The issue seems that does not happen anymore in latest builds.

Actions #4

Updated by lmanfredi 3 months ago

Doing 100 runs, the issue seems there with a rate of 14%. See VRs

Actions #5

Updated by lmanfredi 3 months ago

By setting a greater TIMEOUT_SCALE it seems that issues disappears. See VRs

Actions #6

Updated by lmanfredi 3 months ago

Created openqa-job-groups MR#85

Actions #7

Updated by leli 3 months ago · Edited

I guess this issue may come from ssh access issue, please try to enable ssh for firewall and also enable sshd service.

Ex: autoyast/support_images/sles15sp5_install_textmode_default_patterns_s390x.xml

<services t="list">
<service>dhcpv6-client</service>
<service>ssh</service>
<service>tigervnc</service>
<service>tigervnc-https</service>
</services>

<services-manager t="map">
<default_target>multi-user</default_target>
<services t="map">
<disable t="list"/>
<enable t="list">
<service>firewalld</service>
<service>wicked</service>
<service>kdump</service>
<service>kdump-early</service>
<service>systemd-remount-fs</service>
<service>sshd</service>
</enable>
</services>
</services-manager>

Actions #8

Updated by JERiveraMoya 3 months ago

  • Subject changed from Investigate sporadic failure on AutoYaST SLE Micro installation on s390x to [sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x
Actions #9

Updated by JERiveraMoya 3 months ago

while we still figure out how to fix it, it would be great to avoid extra work to reviewer to set RETRY: 3 and remove it once the solution is found.

Actions #10

Updated by lmanfredi 3 months ago · Edited

Added RETRY: 3 and increased TIMEOUT_SCALE to 15. Now v5.4 works fine.
See VRs
See MR#85

Actions #11

Updated by JERiveraMoya 3 months ago

that is not a solution we should accept, bumping timeout is never good practice, you can only bump it a bit and with some good reason.
please see my comment in https://gitlab.suse.de/qe-yam/openqa-job-groups/-/merge_requests/85#note_596902

Actions #12

Updated by JERiveraMoya 3 months ago

  • Tags changed from qe-yam-feb-sprint to qe-yam-mar-sprint
Actions #13

Updated by JERiveraMoya 2 months ago

we still need to narrow the issue here, we narrow the product where it happens but not the automation code that handles it.

Actions #14

Updated by JERiveraMoya 2 months ago

did you find in autoyast/installation.pm anything to improve here (to increase the timeout in some specific point with some specific condition)?
Perhaps if you point me to the last line executed I can take a closer look?

Actions #15

Updated by lmanfredi 2 months ago · Edited

It seems that in the latest builds does not show the sporadic failures:

Here a VRs to check the behavior with TIMEOUT_SCALE.

Actions #16

Updated by JERiveraMoya 2 months ago

lmanfredi wrote in #note-15:

It seems that in the latest builds does not show the sporadic failures:

Here a VRs to check the behavior with TIMEOUT_SCALE.

For verifications you need 10 or more, 4 days ago the error is present in the history of the job in the description of this ticket.
The problem is most likely related with not matching this needle:
https://openqa.suse.de/tests/13774275#step/installation/3
You need to find this point in the code and discuss with the squad about possibilities. Unfortunately our AutoYaST logic rely on that needle to do some actions afir.

Actions #17

Updated by lmanfredi 2 months ago

Yes, I agree that the problem could be the mismatch for some needles.
At beginning it was seems that the problem was related with the needle having tag import-untrusted-gpg-key.
After I have excluded that, it is happen the problem for the the current one autoyast-stage1-reboot-upcoming-pvm :

ERROR - search: out of range 769 768 104 1024
[2024-03-14T18:25:22.850296+01:00] [warn] [pid:67647] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 4.25 seconds for 98 candidate needles - make your needles more specific
[2024-03-14T18:25:22.850390+01:00] [debug] [pid:67647] no match: 30.8s, best candidate: autoyast-stage1-reboot-upcoming-pvm-20200519 (0.00)
[2024-03-14T18:25:22.851431+01:00] [debug] [pid:67647] considering VNC stalled, no update for 5.25 seconds
[2024-03-14T18:26:31.040729+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
[2024-03-14T18:26:35.104441+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host

Actions #18

Updated by lmanfredi 2 months ago · Edited

Added needle:

with tag package-notification

Needle removed due that causes failures in other tests suite

Actions #19

Updated by lmanfredi about 2 months ago

Created WIP PR#18943 to add some debug info in VRs.

It seems that there are two types of issue:

that seems not related with the needles mismatch, but instead to some kind of random network issue.

Actions #20

Updated by lmanfredi about 2 months ago

From debug info, seems that only one needle with tag autoyast-stage1-reboot-upcoming matches:

[2024-03-25T12:11:40.716282+01:00] [debug] [pid:74221] [installation::_debug_needles] $needles is:
  $VAR1 = {
            'error' => '0',
            'needle' => bless( {
                                 'tags' => [
                                             'ENV-ARCH-ppc64le',
                                             'autoyast-stage1-reboot-upcoming'
                                           ],
                                 'name' => 'autoyast-stage1-reboot-upcoming-pvm-20220318',
                                 'properties' => [],
                                 'area' => [
                                             {
                                               'ypos' => 378,
                                               'width' => 106,
                                               'xpos' => 362,
                                               'margin' => 50,
                                               'height' => 18,
                                               'type' => 'match'
                                             }
                                           ],
                                 'file' => 'autoyast-stage1-reboot-upcoming-pvm-20220318.json',
                                 'png' => 'needles/autoyast-stage1-reboot-upcoming-pvm-20220318.png'
                               }, 'needle' ),
            'area' => [
                        {
                          'x' => 386,
                          'w' => 106,
                          'result' => 'ok',
                          'similarity' => '1',
                          'y' => 378,
                          'h' => 18
                        }
                      ],
            'ok' => 1
          };
Actions #21

Updated by JERiveraMoya about 2 months ago

  • Tags changed from qe-yam-mar-sprint to qe-yam-apr-sprint
Actions #22

Updated by lmanfredi about 2 months ago

By running again the VRs here we have maybe just a sporadic network issue.
See slack comments. E.g.

https://openqa.suse.de/tests/13875587
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm082.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875571
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875550
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875541
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm087.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13875539
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm086.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 

https://openqa.suse.de/tests/13857739
Result: incomplete      
Reason: backend died: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host 


Actions #23

Updated by JERiveraMoya about 1 month ago

could you please paste the last failures of this problem?
It looks like doesn't happen since long time and we already investigated enough and resolving this ticket should be fine.

Actions #24

Updated by lmanfredi about 1 month ago · Edited

From latest builds:

there is only one incomplete for build 20240405-1

Actions #25

Updated by JERiveraMoya about 1 month ago

  • Status changed from In Progress to Resolved
Actions #26

Updated by lmanfredi about 1 month ago

Closed MR#85

Actions

Also available in: Atom PDF