action #134282
Updated by okurz about 1 year ago
## Observations
- Multi-machine jobs can't download artifacts from OBS/pip
## Theory
(Fill this section with our current understanding of how the world works based on observations as written in the next section)
## Problem
* **H1** *REJECT* The product has changed
* -> **E1-1** Compare tests on multiple product versions -> **O1-1-1** We observed the problem in multiple products with different state of maintenance updates and the support server is old SLE12SP3 with no change in maintenance updates since months. It is unlikely that the iscsi client changed recently but that has to be verified
* -> **E1-2** Find output of "openqa-investigate" jobs comparing against "last good" -> **O1-2-1** https://openqa.suse.de/tests/12080239#comment-993398 shows reproducibly four failed tests so reproducible for all states of test and product so reject *H1*
* **H2** Fails because of changes in test setup
* **H2.1** Our test hardware equipment behaves different
* **H2.2** The network behaves different
* **H3** Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
* -> **E3-1** TODO compare package versions installed on machines from "last good" with "first bad", e.g. from /var/log/zypp/history
* -> **E3-2** It is probably *not* the Open vSwitch version, see comment #134282#note-98
* **H4** Fails because of changes in test management configuration, e.g. openQA database settings
* -> wait for E5-1
* **H5** Fails because of changes in the test software itself (the test plan in source code as well as needles)
* -> **E5-1** TODO Compare vars.json from "last good" with "first bad" and in particular look into changes to needles and job templates
* **H6** *REJECT* Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
* -> **O6-1** #134282#note-71 but there is no 100% fail ratio
* -> **E6-2** Increase timeout in the initial step of firewall configuration to check if we have non-reliable test results due to timeouts
* -> TODO Investigate the timeout in the initial step of firewall configuration
* -> TODO Add TIMEOUT_SCALE=3 on non HanasR cluster tests' support servers
* **H7** Multi-machine jobs don't work across workers anymore since 2023-08 -> also see #111908 and #135773
* **H7.1** *REJECT* Multi-machine jobs generally work fine when executed on a single physical machine -> **E7.1-1** Run multi-machines only on a single physical machine -> **O7.1-1-1** See #134282-80
* We *could* pin jobs to a worker but that will need to be implemented properly, see #135035
* We otherwise need to understand the infra setup better
## Suggestions
- Test case improvements
- support_server/setup
- firewall services add zone=EXT service=service:target
- MTU check for packet size - covered in #135200
- MTU size configuration
- By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket. https://sd.suse.com/servicedesk/customer/portal/1/SD-130143
- Come up with better reproducer, e.g. run an openQA test scenario as single-machine test with support_server still on a tap-worker -> see #134282-104
## Rollback steps
- Re-enable [OSD deployments](https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules)
## Out of scope
* Improving openQA upstream documentation -> #135914
* ovs-server+client scenario *and* MTU related fixes -> #135773
* lessons learned -> #136007
* SAP NFS server related issues qesap-nfs.qa.suse.cz -> #135938
* Problems to reach machines in external network in multi-machine tests -> #135056