action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #109719

open

[qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

Added by JRivrain about 3 years ago. Updated 2 months ago.

Status:

New

Priority:

Low

Assignee:

Category:

Bugs in existing tests

Target version:

Start date:

2022-04-08

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

bulkupdate

Description

Observation¶

See also https://openqa.suse.de/tests/8490202#step/validate_user_login_textmode/2

openQA test in scenario sle-15-SP4-Online-ppc64le-autoyast_mini@ppc64le-hmc-single-disk fails in
validate_partition_table_via_blkid

Test suite description¶

Test verifies installation with minimal autoyast profile. Same as autoyast_mini_product but with product defined in the profile.

Reproducible¶

Fails since (at least) Build 119.1

Expected result¶

Last good: 118.3 (or more recent)

Further details¶

Always latest result in this scenario: latest

Files

Download all files

journalctl-p3-xb.txt (3.22 KB) journalctl-p3-xb.txt		JRivrain, 2022-04-13 17:04
bash.log.xz (158 KB) bash.log.xz	vterm output, as there was not newtork to upload logs.	JRivrain, 2022-04-20 12:15

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by JRivrain about 3 years ago

Subject changed from Network issues on ppc64le workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
Description updated (diff)

Actions

Copy link

Updated by JRivrain about 3 years ago

Status changed from New to Rejected

Duplicate of https://progress.opensuse.org/issues/109112

Actions

Copy link

Updated by livdywan about 3 years ago

Is duplicate of action #109112: Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M added

Actions

Copy link

Updated by okurz about 3 years ago

Subject changed from Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"
Status changed from Rejected to New
Priority changed from Normal to High
Target version set to Ready

please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?

But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.

@JRivrain back to you

Actions

Copy link

Updated by okurz about 3 years ago

Target version deleted (~~Ready~~)

Actions

Copy link

Updated by JRivrain about 3 years ago

File bash.log.xz added

okurz wrote:

please be aware that #109112 will likely only handle better error reporting, not fix the root cause. I wonder about the FQDN. According to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1066 there are these redcurrant-$i.qa.suse.de entries but racktables only knows a machine redcurrant.arch.suse.de, are the LPARs of an arch machine really in the QA network?

But also I see https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4 which shows that a system is not fully booted yet. Also in https://openqa.suse.de/tests/8490202#step/first_boot/1 one can see that there is an empty "eth0" entry. So the test should be adapted to really only try to access the machine when it's actually reachable. This has nothing to do with any kind of infrastructure problems.

@JRivrain back to you

I am not sure this has nothing to do with infrastructure, eth0 not having an address could be due to some network issue, with for example dhcp requests not completing, so it prevents the system from starting normally as it never reaches network target. I don't see how we could change the test code to accommodate that, if the source of the problem is a faulty network. Regarding https://openqa.suse.de/tests/8492975#step/validate_user_login_textmode/4, I agree that this could be something else, I can make a different report for it, but it could be also due to the system being in degraded mode because services could not start.
Now this current ticket is about the two issues in description, clearly indicating a network problem on a fully booted system at login prompt, in degraded mode, due at least partly to the fact that wicked service did not start :

Apr 13 11:38:53 install systemd[1]: network.target: Job 
wicked.service/start deleted to break ordering cycle starting with network.targe
t/start
Apr 13 11:38:53 install systemd[1]: wickedd-nanny.servic
e: Job wickedd.service/start deleted to break ordering cycle starting with wicke
dd-nanny.service/start
Apr 13 11:38:53 install systemd[1]: network.target: Job 
wickedd-dhcp4.service/start deleted to break ordering cycle starting with networ
k.target/start
Apr 13 11:38:53 install systemd[1]: xvnc.socket: Job YaS
T2-Firstboot.service/start deleted to break ordering cycle starting with xvnc.so
cket/start
Apr 13 11:38:55 install wickedd-nanny[861]: /org/opensus
e/Network/Interface.getManagedObjects failed. Server responds:
Apr 13 11:38:55 install wickedd-nanny[861]: org.freedesk
top.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided by
 any .service files
Apr 13 11:38:55 install wickedd-nanny[861]: Couldn't ref
resh list of active network interfaces

"ip a" shows the device down. I tried starting it and restart wicked, also tried to restart it with yast2 lan, all fails, it looks like dhcp requests don't complete.

Note that the issue is sporadic.
We can report a bug, but that will be loss of time if there is an infra issue behind. WDYT ?
Please see the logs attached (bash.log.xz) it is the direct output from the vterm as I did obviously not have access to the network.

Actions

Copy link

Updated by JRivrain about 3 years ago

File journalctl-p3-xb.txt journalctl-p3-xb.txt added

Actions

Copy link

Updated by JRivrain about 3 years ago

Related to action #109986: Investigate failure to connect to hmc, consider adding more waiting time. added

Actions

Copy link

Updated by JRivrain about 3 years ago

@okurz, please consider my comments when you have time, there is no networking at all in the guest, sporadically. We need to determine if it's a product bug.

Regarding the other issue you were mentioning, We wan see here that the system is at login prompt, with an IP address : boot has completed, and yet it looks right after like we cannot log-in because system did not finish booting. In another run where this happened, there is no IP at login prompt : https://openqa.suse.de/tests/8552922#step/validate_user_login_textmode/4. But despite it, we are able to attempt an ssh connection, so the guest has some networking at that point. This is confusing, but could happen because network target is delayed to to slow/malfunctioning network.
I created a different ticket for it, as it is a slightly different problem : https://progress.opensuse.org/issues/109986

Actions

Copy link

#10

Updated by JRivrain about 3 years ago

File deleted (~~bash.log.xz~~)

Actions

Copy link

#11

Updated by JRivrain about 3 years ago

File bash.log.xz bash.log.xz added

Reformatted the log file without escape characters.
Search for Y2LOG, JOURNALCTL, DMESG to navigate in shell output: Y2LOG is written at the start of /var/log/y2log and so on.

Actions

Copy link

#12

Updated by JRivrain about 3 years ago

Subject changed from [y][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

Removing [y] from title, as this error "Error connecting to redcurrant-3.qa.suse.de: no route to host" is very likely to be a network issue that we cannot work around: even manually I could not start the network from the the VM.
The other issue is https://bugzilla.suse.com/show_bug.cgi?id=1198294, I'll remove the incorrect tags from openQA jobs.
If we are sure the issue in the description ("no route to host") is not with infra, but a product bug, then we need to report it. maybe it could have also something to do with https://bugzilla.suse.com/show_bug.cgi?id=1198294.
Please let me know.

Actions

Copy link

#13

Updated by okurz about 3 years ago

Subject changed from [ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" to [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

@qe-core could your PPC experts look into this?

Actions

Copy link

#14

Updated by szarate almost 3 years ago

These tickets are not on high prio

Actions

Copy link

#15

Updated by szarate almost 3 years ago

Tags set to bulkupdate

These tickets are not on high pro

Actions

Copy link

#16

Updated by szarate almost 3 years ago

Priority changed from High to Normal

Actions

Copy link

#17

Updated by slo-gin about 1 year ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

#18

Updated by slo-gin 2 months ago

Priority changed from Normal to Low

This ticket was set to Normal priority but was not updated within the SLO period. The ticket will be set to the next lower priority Low.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #109719

[qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by JRivrain about 3 years ago

Updated by okurz about 3 years ago

Updated by szarate almost 3 years ago

Updated by szarate almost 3 years ago

Updated by szarate almost 3 years ago

Updated by slo-gin about 1 year ago

Updated by slo-gin 2 months ago