Project

General

Profile

action #134282

Updated by livdywan about 1 year ago

## Observations 
 - Multi-machine jobs can't download artifacts from OBS/pip Observation 

 ## Problem There are multiple failures going on on iscsi tests done on multimachine setup. 
 * **H1** *REJECT* The product has changed 
   * -> **E1-1** Compare So far, almost all tests are failing on multiple product versions -> **O1-1-1** We observed "iscsi_client" step, like: 
 12SP5: 
 https://openqa.suse.de/tests/11821503 
 15SP1: 
 https://openqa.suse.de/tests/11822477 
 15SP2: 
 https://openqa.suse.de/tests/11827371 
 15SP3: 
 https://openqa.suse.de/tests/11821798 
 15SP4: 
 https://openqa.suse.de/tests/11820612 
 15SP5: 
 https://openqa.suse.de/tests/11821882 

 So far, I was unable to pinpoint an update that could be the problem in multiple products with different state root cause of maintenance updates and the support server this issue (since it is old SLE12SP3 with no change in maintenance updates since months. It is unlikely that the iscsi client changed recently but that has to be verified 
      * -> **E1-2** Find output of "openqa-investigate" jobs comparing against "last good" -> **O1-2-1** https://openqa.suse.de/tests/12080239#comment-993398 shows reproducibly four failed tests so reproducible for happening on all states of test and product so reject *H1* supported sles versions) 
 * **H2** Fails because of changes in From the serial0.txt log from one test setup 
  * **H2.1** Our test hardware equipment behaves different 
  * **H2.2** The network behaves different 
 * **H3** Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA 
   * -> **E3-1** TODO compare package versions installed on machines from "last good" node, it seems that it somehow lost communication with "first bad", e.g. from /var/log/zypp/history iscsi server: 
 * **H4** Fails because [    445.225255][ T3182] sd 3:0:0:0: [sda] Optimal transfer size 42949672 logical blocks > dev_max (65535 logical blocks) 
 [    455.449746][      C3]    connection1:0: ping timeout of changes in test management configuration, e.g. openQA database settings 
   * -> wait for E5-1 5 secs expired, recv timeout 5, last rx 4295003573, last ping 4295004824, now 4295006080 
 * **H5** Fails because [    455.452820][      C3]    connection1:0: detected conn error (1022) 
 [    455.281284] iscsid[9644]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3) 
 [    458.309464] iscsid[9644]: iscsid: connection1:0 is operational after recovery (1 attempts) 
 [    458.513865][ T9694] sd 3:0:0:2: Attached scsi generic sg2 type 0 
 [    468.761789][      C3]    connection1:0: ping timeout of changes in the test software itself (the test plan in source code as well as needles) 
   * -> **E5-1** TODO Compare vars.json from "last good" with "first bad" and in particular look into changes to needles and job templates 5 secs expired, recv timeout 5, last rx 4295006845, last ping 4295008128, now 4295009408 
 * **H6** *REJECT* Sporadic issue, i.e. the root problem [    468.765772][      C3]    connection1:0: detected conn error (1022) 
 [    468.594268] iscsid[9644]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3) 
 [    471.621874] iscsid[9644]: iscsid: connection1:0 is already hidden in the system for operational after recovery (1 attempts) 

 I'm struggling to get a long time debug mode to run, since it seems that osd is overloaded at this moment, but does not show symptoms every last time 
   * -> **O6-1** https://progress.opensuse.org/issues/134282#note-71 but there is no 100% fail ratio 
   * -> **E6-2** Increase timeout in the initial step of firewall configuration to check if we I have non-reliable test results due tried to timeouts 
   * -> TODO Investigate the timeout in the initial step of firewall configuration 
   * -> TODO Add TIMEOUT_SCALE=3 on non HanasR cluster tests' support servers debug, I was able to ping and communicate with "support server" normally (but issue was not happening very often that time) 

 ## Acceptance criteria 
 * **H7** **AC1:** Multi-machine jobs don't tests work across workers -> **E7-1** Run multi-machines only on a single with different physical machine -> **O7-1-1** TBD 
   * We *could* pin jobs to a worker but that will need to be implemented properly, see #135035 
   * We otherwise need to understand the infra setup better hosts 

 ## Suggestions 
 - Test case improvements 
   - support_server/setup 
   - firewall services add zone=EXT service=service:target 
   - MTU check File SD-INFRA ticket for packet size network issue 
 - covered in #135200 Confirm how #111908 is related 

 ## Work-arounds 
 - MTU size configuration Adjust job groups to pin specific workers by hard-coding relevant worker classes 
   - By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port TODO: Identify what has been implemented and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then need to be request via SD-Ticket. https://sd.suse.com/servicedesk/customer/portal/1/SD-130143 undone 
 - Come up with better reproducer, e.g. run an openQA test scenario as single-machine test with support_server still on a tap-worker Adjust salt to remove workers not known to work 
   - https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596/diffs 
   - No concensus and hence not merged yet

Back