Project

General

Profile

action #138698

Updated by livdywan about 1 year ago

## Observation 

 openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_priorityfencing_supportserver@64bit fails in 
 [setup](https://openqa.suse.de/tests/12691358/modules/setup/steps/67) 

 ## Test suite description 
 The base test suite is used for job templates defined in YAML documents. It has no settings of its own. 


 ## Reproducible 

 Not easily reproducible. Failure is sporadic. See Next & Previous Results tab in linked test. 

 Failed on (at least) Build [:29290:libfido2](https://openqa.suse.de/tests/12691358) (current job) 


 ## Expected result 

 Last good: [:29978:qemu](https://openqa.suse.de/tests/12686884) (or more recent) 

 ## Acceptance criteria 
 * **AC1:** [qam_ha_priorityfencing_supportserver](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priorityfencing_supportserver&version=15-SP5) scenario passes reliably 
 * **AC2:** Unrelated issues are identified and tracked as individual issues 

 ## Problem 
 * **H1** The product has changed, unclear as of #138698-5 
 * **H2** Fails because of changes in test setup 
  * **H2.1** worker3[3-6] are problematic 
    * -> **E2.1-1** Disable worker3[3-6] and test if https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=24 improves again towards lower fail-ratio, see #138698-3 
    * **E2.1-2** Disable worker3{3-4} https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666 Has led to a good passing rate 
    * **E2.1-3** Disable worker33 only https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/667 *Being conducted* 
 * **H3** *REJECTED* Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA -> **O3-1** #138698-6 -> reject 
 * **H4** *REJECTED* Fails because of changes in test management configuration, e.g. openQA database settings -> **O4-1** #138698-5 -> reject 
 * **H5** *REJECTED* Fails because of changes in the test software itself (the test plan in source code as well as needles) -> **O5-1** #138698-5 -> reject 
 * **H6** *REJECTED* Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time -> **O6-1** https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1698050125250&to=1698433324349&viewPanel=24 shows significant increase in ratio of failed+parallel_failed 2023-10-25 4+7=11% to 2023-10-27 15+32=47% -> reject but test code is racy as of #138698-5 

 ## Suggestions 
 * Consider temporarily disabling GRE tunnel use again completely, i.e. only run multi-machine tests again on a single host - see #135035 
 * Temporarily disable tap class from more and more workers trying to narrow down or identify the culprit 
 * host to host communication seems to be not stable. 
   * Run openQA tests as well as more low-level tests 
   * the iscsi-server+client test scenario 
   * http://open.qa/docs/#_debugging_open_vswitch_configuration 
 * With all mitigations monitor the impact on job queue to prevent overload and too long job queues, e.g. see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now 


 ## Further details 

 Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priorityfencing_supportserver&version=15-SP5) 

 ## Rollback steps 
 * Put back tap on 33/34 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666

Back