action #132827
Updated by rfan1 over 1 year ago
## Observation
I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check?
Some error messages as below:
https://openqa.suse.de/tests/11593878#step/salt_master/15
http://openqa.suse.de/tests/11594635#step/rsync_client/12
## Reproducible
[Failed test links](https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&arch=&flavor=&machine=&test=&modules=salt_master%2Crsync_client&module_re=&distri=sle&build=20230716-1&groupid=414#)
## Expected result
I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies
## Rollback steps
Add back production worker class on sapworker{1,2,3}, i.e. revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564
## Further details
May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1"
## Suggestions
- First ensure that all openQA workers have the salt state applied cleanly, e.g. `sudo salt --no-color -C 'G@roles:worker' state.apply`
- Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason
- As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX
- Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup
- Ensure that our salt states ensure all what is needed to run stable multi-machine tests
- Add back production worker classes for all affected machines openqaworker1, worker5, sapworker{1-7}