Project

General

Profile

Actions

action #109719

open

[qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

Added by JRivrain about 2 years ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2022-04-08
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

See also https://openqa.suse.de/tests/8490202#step/validate_user_login_textmode/2

openQA test in scenario sle-15-SP4-Online-ppc64le-autoyast_mini@ppc64le-hmc-single-disk fails in
validate_partition_table_via_blkid

Test suite description

Test verifies installation with minimal autoyast profile. Same as autoyast_mini_product but with product defined in the profile.

Reproducible

Fails since (at least) Build 119.1

Expected result

Last good: 118.3 (or more recent)

Further details

Always latest result in this scenario: latest


Files

journalctl-p3-xb.txt (3.22 KB) journalctl-p3-xb.txt JRivrain, 2022-04-13 17:04
bash.log.xz (158 KB) bash.log.xz vterm output, as there was not newtork to upload logs. JRivrain, 2022-04-20 12:15

Related issues 2 (1 open1 closed)

Related to qe-yam - action #109986: Investigate failure to connect to hmc, consider adding more waiting time.Rejected2022-04-14

Actions
Is duplicate of openQA Infrastructure - action #109112: Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:MWorkable2022-03-28

Actions
Actions #1

Updated by JRivrain about 2 years ago

  • Subject changed from Network issues on ppc64le workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
  • Description updated (diff)
Actions #2

Updated by JRivrain about 2 years ago

  • Status changed from New to Rejected
Actions #3

Updated by livdywan about 2 years ago

  • Is duplicate of action #109112: Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M added
Actions #4

Updated by okurz about 2 years ago

  • Subject changed from Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
  • Status changed from Rejected to New
  • Priority changed from Normal to High
  • Target version set to Ready

please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?

But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.

@JRivrain back to you

Actions #5

Updated by okurz about 2 years ago

  • Target version deleted (Ready)
Actions #6

Updated by JRivrain about 2 years ago

  • File bash.log.xz added

okurz wrote:

please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?

But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.

@JRivrain back to you

I am not sure this has nothing to do with infrastructure, eth0 not having an address could be due to some network issue, with for example dhcp requests not completing, so it prevents the system from starting normally as it never reaches network target. I don't see how we could change the test code to accommodate that, if the source of the problem is a faulty network. Regarding https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4, I agree that this could be something else, I can make a different report for it, but it could be also due to the system being in degraded mode because services could not start.
Now this current ticket is about the two issues in description, clearly indicating a network problem on a fully booted system at login prompt, in degraded mode, due at least partly to the fact that wicked service did not start :

Apr 13 11:38:53 install systemd[1]: network.target: Job 
wicked.service/start deleted to break ordering cycle starting with network.targe
t/start
Apr 13 11:38:53 install systemd[1]: wickedd-nanny.servic
e: Job wickedd.service/start deleted to break ordering cycle starting with wicke
dd-nanny.service/start
Apr 13 11:38:53 install systemd[1]: network.target: Job 
wickedd-dhcp4.service/start deleted to break ordering cycle starting with networ
k.target/start
Apr 13 11:38:53 install systemd[1]: xvnc.socket: Job YaS
T2-Firstboot.service/start deleted to break ordering cycle starting with xvnc.so
cket/start
Apr 13 11:38:55 install wickedd-nanny[861]: /org/opensus
e/Network/Interface.getManagedObjects failed. Server responds:
Apr 13 11:38:55 install wickedd-nanny[861]: org.freedesk
top.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided by
 any .service files
Apr 13 11:38:55 install wickedd-nanny[861]: Couldn't ref
resh list of active network interfaces

"ip a" shows the device down. I tried starting it and restart wicked, also tried to restart it with yast2 lan, all fails, it looks like dhcp requests don't complete.

Note that the issue is sporadic.
We can report a bug, but that will be loss of time if there is an infra issue behind. WDYT ?
Please see the logs attached (bash.log.xz) it is the direct output from the vterm as I did obviously not have access to the network.

Actions #8

Updated by JRivrain about 2 years ago

  • Related to action #109986: Investigate failure to connect to hmc, consider adding more waiting time. added
Actions #9

Updated by JRivrain about 2 years ago

@okurz, please consider my comments when you have time, there is no networking at all in the guest, sporadically. We need to determine if it's a product bug.

Regarding the other issue you were mentioning, We wan see here that the system is at login prompt, with an IP address : boot has completed, and yet it looks right after like we cannot log-in because system did not finish booting. In another run where this happened, there is no IP at login prompt : https://openqa.suse.de/tests/8552922#step/validate_user_login_textmode/4. But despite it, we are able to attempt an ssh connection, so the guest has some networking at that point. This is confusing, but could happen because network target is delayed to to slow/malfunctioning network.
I created a different ticket for it, as it is a slightly different problem : https://progress.opensuse.org/issues/109986

Actions #10

Updated by JRivrain about 2 years ago

  • File deleted (bash.log.xz)
Actions #11

Updated by JRivrain about 2 years ago

Reformatted the log file without escape characters.
Search for Y2LOG, JOURNALCTL, DMESG to navigate in shell output: Y2LOG is written at the start of /var/log/y2log and so on.

Actions #12

Updated by JRivrain about 2 years ago

  • Subject changed from [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

Removing [y] from title, as this error "Error connecting to redcurrant-3.qa.suse.de: no route to host" is very likely to be a network issue that we cannot work around: even manually I could not start the network from the the VM.
The other issue is https://bugzilla.suse.com/show_bug.cgi?id=1198294, I'll remove the incorrect tags from openQA jobs.
If we are sure the issue in the description ("no route to host") is not with infra, but a product bug, then we need to report it. maybe it could have also something to do with https://bugzilla.suse.com/show_bug.cgi?id=1198294.
Please let me know.

Actions #13

Updated by okurz about 2 years ago

  • Subject changed from [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

@qe-core could your PPC experts look into this?

Actions #14

Updated by szarate almost 2 years ago

These tickets are not on high prio

Actions #15

Updated by szarate almost 2 years ago

  • Tags set to bulkupdate

These tickets are not on high pro

Actions #16

Updated by szarate almost 2 years ago

  • Priority changed from High to Normal
Actions #17

Updated by slo-gin about 1 month ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Also available in: Atom PDF