action #154357
closedcoordination #151816: [epic] Handle openQA fixes and job group setup
[sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x
Added by rainerkoenig 11 months ago. Updated 9 months ago.
0%
Description
Motivation¶
In SLE Micro Maintenance Updates we sporadically experience incomplete slem_installation_autoyast
tests.
Example: https://openqa.suse.de/tests/13356734
OpenQA gives the following reason:
Reason: backend died: Error connecting to VNC server <s390kvm099.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
Scope:¶
- SLE Micro YaST & Migration Maintenance Updates job group (ID=535)
- s390x architecture
Acceptance criteria:¶
- AC1: Root cause is found and test become stable again.
Additional info¶
According to @JERiveraMoya this is not an infrastructure issue:
"it is related with some installation problem or whatever other problem, and then when we try to reconnect to the sut it is not possible because it is not in good shape to reconnect, so unrelated with infra."
Most likely some timing issues in some point.
Updated by JERiveraMoya 11 months ago
- Tags set to qe-yam-feb-sprint
- Description updated (diff)
- Status changed from New to Workable
- Parent task set to #151816
Updated by leli 10 months ago · Edited
I guess this issue may come from ssh access issue, please try to enable ssh for firewall and also enable sshd service.
Ex: autoyast/support_images/sles15sp5_install_textmode_default_patterns_s390x.xml
<services t="list">
<service>dhcpv6-client</service>
<service>ssh</service>
<service>tigervnc</service>
<service>tigervnc-https</service>
</services>
<services-manager t="map">
<default_target>multi-user</default_target>
<services t="map">
<disable t="list"/>
<enable t="list">
<service>firewalld</service>
<service>wicked</service>
<service>kdump</service>
<service>kdump-early</service>
<service>systemd-remount-fs</service>
<service>sshd</service>
</enable>
</services>
</services-manager>
Updated by JERiveraMoya 10 months ago
- Subject changed from Investigate sporadic failure on AutoYaST SLE Micro installation on s390x to [sporadic] Investigate failure on AutoYaST SLE Micro installation on s390x
Updated by JERiveraMoya 10 months ago
while we still figure out how to fix it, it would be great to avoid extra work to reviewer to set RETRY: 3
and remove it once the solution is found.
Updated by JERiveraMoya 10 months ago
that is not a solution we should accept, bumping timeout is never good practice, you can only bump it a bit and with some good reason.
please see my comment in https://gitlab.suse.de/qe-yam/openqa-job-groups/-/merge_requests/85#note_596902
Updated by JERiveraMoya 10 months ago
- Tags changed from qe-yam-feb-sprint to qe-yam-mar-sprint
Updated by JERiveraMoya 10 months ago
we still need to narrow the issue here, we narrow the product where it happens but not the automation code that handles it.
Updated by JERiveraMoya 10 months ago
did you find in autoyast/installation.pm anything to improve here (to increase the timeout in some specific point with some specific condition)?
Perhaps if you point me to the last line executed I can take a closer look?
Updated by JERiveraMoya 10 months ago
lmanfredi wrote in #note-15:
It seems that in the latest builds does not show the sporadic failures:
Here a VRs to check the behavior with
TIMEOUT_SCALE
.
For verifications you need 10 or more, 4 days ago the error is present in the history of the job in the description of this ticket.
The problem is most likely related with not matching this needle:
https://openqa.suse.de/tests/13774275#step/installation/3
You need to find this point in the code and discuss with the squad about possibilities. Unfortunately our AutoYaST logic rely on that needle to do some actions afir.
Updated by lmanfredi 10 months ago
Yes, I agree that the problem could be the mismatch for some needles.
At beginning it was seems that the problem was related with the needle having tag import-untrusted-gpg-key
.
After I have excluded that, it is happen the problem for the the current one autoyast-stage1-reboot-upcoming-pvm
:
ERROR - search: out of range 769 768 104 1024
[2024-03-14T18:25:22.850296+01:00] [warn] [pid:67647] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 4.25 seconds for 98 candidate needles - make your needles more specific
[2024-03-14T18:25:22.850390+01:00] [debug] [pid:67647] no match: 30.8s, best candidate: autoyast-stage1-reboot-upcoming-pvm-20200519 (0.00)
[2024-03-14T18:25:22.851431+01:00] [debug] [pid:67647] considering VNC stalled, no update for 5.25 seconds
[2024-03-14T18:26:31.040729+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
[2024-03-14T18:26:35.104441+01:00] [warn] [pid:67647] !!! consoles::VNC::login: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
Updated by lmanfredi 9 months ago
Created WIP PR#18943 to add some debug info in VRs.
It seems that there are two types of issue:
- Reason: timeout: test execution exceeded MAX_JOB_TIME
- Reason: backend died: Error connecting to VNC server s390kvm091.oqa.prg2.suse.org:5901: IO::Socket::INET: connect: No route to host
that seems not related with the needles mismatch, but instead to some kind of random network issue.
Updated by lmanfredi 9 months ago
From debug info, seems that only one needle with tag autoyast-stage1-reboot-upcoming
matches:
[2024-03-25T12:11:40.716282+01:00] [debug] [pid:74221] [installation::_debug_needles] $needles is:
$VAR1 = {
'error' => '0',
'needle' => bless( {
'tags' => [
'ENV-ARCH-ppc64le',
'autoyast-stage1-reboot-upcoming'
],
'name' => 'autoyast-stage1-reboot-upcoming-pvm-20220318',
'properties' => [],
'area' => [
{
'ypos' => 378,
'width' => 106,
'xpos' => 362,
'margin' => 50,
'height' => 18,
'type' => 'match'
}
],
'file' => 'autoyast-stage1-reboot-upcoming-pvm-20220318.json',
'png' => 'needles/autoyast-stage1-reboot-upcoming-pvm-20220318.png'
}, 'needle' ),
'area' => [
{
'x' => 386,
'w' => 106,
'result' => 'ok',
'similarity' => '1',
'y' => 378,
'h' => 18
}
],
'ok' => 1
};
Updated by JERiveraMoya 9 months ago
- Tags changed from qe-yam-mar-sprint to qe-yam-apr-sprint
Updated by lmanfredi 9 months ago
By running again the VRs here we have maybe just a sporadic network issue.
See slack comments. E.g.
https://openqa.suse.de/tests/13875587
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm082.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
https://openqa.suse.de/tests/13875571
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
https://openqa.suse.de/tests/13875550
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm081.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
https://openqa.suse.de/tests/13875541
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm087.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
https://openqa.suse.de/tests/13875539
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm086.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
https://openqa.suse.de/tests/13857739
Result: incomplete
Reason: backend died: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
Updated by JERiveraMoya 9 months ago
could you please paste the last failures of this problem?
It looks like doesn't happen since long time and we already investigated enough and resolving this ticket should be fine.
Updated by JERiveraMoya 9 months ago
- Status changed from In Progress to Resolved