action #138698
Updated by livdywan about 1 year ago
## Observation openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_priorityfencing_supportserver@64bit fails in [setup](https://openqa.suse.de/tests/12691358/modules/setup/steps/67) ## Test suite description The base test suite is used for job templates defined in YAML documents. It has no settings of its own. ## Reproducible Not easily reproducible. Failure is sporadic. See Next & Previous Results tab in linked test. Failed on (at least) Build [:29290:libfido2](https://openqa.suse.de/tests/12691358) (current job) ## Expected result Last good: [:29978:qemu](https://openqa.suse.de/tests/12686884) (or more recent) ## Acceptance criteria * **AC1:** [qam_ha_priorityfencing_supportserver](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priorityfencing_supportserver&version=15-SP5) scenario passes reliably * **AC2:** Unrelated issues are identified and tracked as individual issues ## Problem * **H1** The product has changed, unclear as of #138698-5 * **H2** Fails because of changes in test setup * **H2.1** worker3[3-6] are problematic -> **E2.1-1** Disable worker3[3-6] and test if https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=24 improves again towards lower fail-ratio, see #138698-3 * **H3** *REJECTED* Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA -> **O3-1** #138698-6 -> reject * **H4** *REJECTED* Fails because of changes in test management configuration, e.g. openQA database settings -> **O4-1** #138698-5 -> reject * **H5** *REJECTED* Fails because of changes in the test software itself (the test plan in source code as well as needles) -> **O5-1** #138698-5 -> reject * **H6** *REJECTED* Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time -> **O6-1** https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1698050125250&to=1698433324349&viewPanel=24 shows significant increase in ratio of failed+parallel_failed 2023-10-25 4+7=11% to 2023-10-27 15+32=47% -> reject but test code is racy as of #138698-5 ## Suggestions * Consider temporarily disabling GRE tunnel use again completely, i.e. only run multi-machine tests again on a single host - see #135035 * Temporarily disable tap class from more and more workers trying to narrow down or identify the culprit * host to host communication seems to be not stable. * Run openQA tests as well as more low-level tests * the iscsi-server+client test scenario * http://open.qa/docs/#_debugging_open_vswitch_configuration * With all mitigations monitor the impact on job queue to prevent overload and too long job queues, e.g. see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now ## Further details Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priorityfencing_supportserver&version=15-SP5) ## Rollback steps * Put back tap on 33/34 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666