action #134282
Updated by livdywan about 1 year ago
## Observation
There are multiple failures going on on iscsi tests done on multimachine setup.
So far, almost all tests are failing on "iscsi_client" step, like:
12SP5:
https://openqa.suse.de/tests/11821503
15SP1:
https://openqa.suse.de/tests/11822477
15SP2:
https://openqa.suse.de/tests/11827371
15SP3:
https://openqa.suse.de/tests/11821798
15SP4:
https://openqa.suse.de/tests/11820612
15SP5:
https://openqa.suse.de/tests/11821882
So far, I was unable to pinpoint an update that could be the root cause of this issue (since it is happening on all supported sles versions)
From the serial0.txt log from one test node, it seems that it somehow lost communication with iscsi server:
[ 445.225255][ T3182] sd 3:0:0:0: [sda] Optimal transfer size 42949672 logical blocks > dev_max (65535 logical blocks)
[ 455.449746][ C3] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295003573, last ping 4295004824, now 4295006080
[ 455.452820][ C3] connection1:0: detected conn error (1022)
[ 455.281284] iscsid[9644]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
[ 458.309464] iscsid[9644]: iscsid: connection1:0 is operational after recovery (1 attempts)
[ 458.513865][ T9694] sd 3:0:0:2: Attached scsi generic sg2 type 0
[ 468.761789][ C3] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295006845, last ping 4295008128, now 4295009408
[ 468.765772][ C3] connection1:0: detected conn error (1022)
[ 468.594268] iscsid[9644]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
[ 471.621874] iscsid[9644]: iscsid: connection1:0 is operational after recovery (1 attempts)
I'm struggling to get a debug mode to run, since it seems that osd is overloaded at this moment, but last time I have tried to debug, I was able to ping and communicate with "support server" normally (but issue was not happening very often that time)
## Acceptance criteria
* **AC1:** Multi-machine tests work with different physical hosts
## Suggestions
- File SD-INFRA ticket for network issue
- Confirm how #111908 is related
## Work-arounds
- Adjust job groups to pin specific workers by hard-coding relevant worker classes
- TODO: Identify what has been implemented and will need to be undone
- Adjust salt to remove workers not known to work
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596/diffs
- No concensus and hence not merged yet