action #109719
open[qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
Added by JRivrain almost 3 years ago. Updated 10 months ago.
0%
Description
Observation¶
See also https://openqa.suse.de/tests/8490202#step/validate_user_login_textmode/2
openQA test in scenario sle-15-SP4-Online-ppc64le-autoyast_mini@ppc64le-hmc-single-disk fails in
validate_partition_table_via_blkid
Test suite description¶
Test verifies installation with minimal autoyast profile. Same as autoyast_mini_product but with product defined in the profile.
Reproducible¶
Fails since (at least) Build 119.1
Expected result¶
Last good: 118.3 (or more recent)
Further details¶
Always latest result in this scenario: latest
Files
journalctl-p3-xb.txt (3.22 KB) journalctl-p3-xb.txt | JRivrain, 2022-04-13 17:04 | ||
bash.log.xz (158 KB) bash.log.xz | vterm output, as there was not newtork to upload logs. | JRivrain, 2022-04-20 12:15 |
Updated by JRivrain almost 3 years ago
- Subject changed from Network issues on ppc64le workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
- Description updated (diff)
Updated by JRivrain almost 3 years ago
- Status changed from New to Rejected
Duplicate of https://progress.opensuse.org/issues/109112
Updated by livdywan almost 3 years ago
- Is duplicate of action #109112: Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M added
Updated by okurz almost 3 years ago
- Subject changed from Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
- Status changed from Rejected to New
- Priority changed from Normal to High
- Target version set to Ready
please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?
But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.
@JRivrain back to you
Updated by JRivrain almost 3 years ago
- File bash.log.xz added
okurz wrote:
please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?
But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.
@JRivrain back to you
I am not sure this has nothing to do with infrastructure, eth0 not having an address could be due to some network issue, with for example dhcp requests not completing, so it prevents the system from starting normally as it never reaches network target. I don't see how we could change the test code to accommodate that, if the source of the problem is a faulty network. Regarding https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4, I agree that this could be something else, I can make a different report for it, but it could be also due to the system being in degraded mode because services could not start.
Now this current ticket is about the two issues in description, clearly indicating a network problem on a fully booted system at login prompt, in degraded mode, due at least partly to the fact that wicked service did not start :
Apr 13 11:38:53 install systemd[1]: network.target: Job
wicked.service/start deleted to break ordering cycle starting with network.targe
t/start
Apr 13 11:38:53 install systemd[1]: wickedd-nanny.servic
e: Job wickedd.service/start deleted to break ordering cycle starting with wicke
dd-nanny.service/start
Apr 13 11:38:53 install systemd[1]: network.target: Job
wickedd-dhcp4.service/start deleted to break ordering cycle starting with networ
k.target/start
Apr 13 11:38:53 install systemd[1]: xvnc.socket: Job YaS
T2-Firstboot.service/start deleted to break ordering cycle starting with xvnc.so
cket/start
Apr 13 11:38:55 install wickedd-nanny[861]: /org/opensus
e/Network/Interface.getManagedObjects failed. Server responds:
Apr 13 11:38:55 install wickedd-nanny[861]: org.freedesk
top.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided by
any .service files
Apr 13 11:38:55 install wickedd-nanny[861]: Couldn't ref
resh list of active network interfaces
"ip a" shows the device down. I tried starting it and restart wicked, also tried to restart it with yast2 lan, all fails, it looks like dhcp requests don't complete.
Note that the issue is sporadic.
We can report a bug, but that will be loss of time if there is an infra issue behind. WDYT ?
Please see the logs attached (bash.log.xz) it is the direct output from the vterm as I did obviously not have access to the network.
Updated by JRivrain almost 3 years ago
- File journalctl-p3-xb.txt journalctl-p3-xb.txt added
Updated by JRivrain almost 3 years ago
- Related to action #109986: Investigate failure to connect to hmc, consider adding more waiting time. added
Updated by JRivrain almost 3 years ago
@okurz, please consider my comments when you have time, there is no networking at all in the guest, sporadically. We need to determine if it's a product bug.
Regarding the other issue you were mentioning, We wan see here that the system is at login prompt, with an IP address : boot has completed, and yet it looks right after like we cannot log-in because system did not finish booting. In another run where this happened, there is no IP at login prompt : https://openqa.suse.de/tests/8552922#step/validate_user_login_textmode/4. But despite it, we are able to attempt an ssh connection, so the guest has some networking at that point. This is confusing, but could happen because network target is delayed to to slow/malfunctioning network.
I created a different ticket for it, as it is a slightly different problem : https://progress.opensuse.org/issues/109986
Updated by JRivrain over 2 years ago
- File bash.log.xz bash.log.xz added
Reformatted the log file without escape characters.
Search for Y2LOG, JOURNALCTL, DMESG to navigate in shell output: Y2LOG is written at the start of /var/log/y2log and so on.
Updated by JRivrain over 2 years ago
- Subject changed from [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
Removing [y] from title, as this error "Error connecting to redcurrant-3.qa.suse.de: no route to host" is very likely to be a network issue that we cannot work around: even manually I could not start the network from the the VM.
The other issue is https://bugzilla.suse.com/show_bug.cgi?id=1198294, I'll remove the incorrect tags from openQA jobs.
If we are sure the issue in the description ("no route to host") is not with infra, but a product bug, then we need to report it. maybe it could have also something to do with https://bugzilla.suse.com/show_bug.cgi?id=1198294.
Please let me know.
Updated by okurz over 2 years ago
- Subject changed from [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
@qe-core could your PPC experts look into this?
Updated by slo-gin 10 months ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.