action #132827
Updated by okurz 6 months ago
## Observation
I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check?
Some error messages as below:
https://openqa.suse.de/tests/11593878#step/salt_master/15
http://openqa.suse.de/tests/11594635#step/rsync_client/12
## Reproducible
[Failed test links](https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&arch=&flavor=&machine=&test=&modules=salt_master%2Crsync_client&module_re=&distri=sle&build=20230716-1&groupid=414#)
## Expected result
I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies
## Rollback steps
* Add back production worker class on all OSD machines mentioning #132827 sapworker{1,2,3}, i.e. revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564
* Add back "tap" worker class to openqaworker1 and sapworker{1,2,3}
## Further details
May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1"
## Suggestions
- First ensure that all openQA workers have the salt state applied cleanly, e.g. `sudo salt --no-color -C 'G@roles:worker' state.apply`
- Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason
- ~~As As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX~~ tap_pooXXX
- ~~Debug Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup~~ http://open.qa/docs/#_verify_the_setup
- ~~Ensure Ensure that our salt states ensure all what is needed to run stable multi-machine tests~~ tests
- Add back production worker classes for all affected machines ~~openqaworker1, sapworker{1-7}~~, e.g. qesapworker-prg1-5 openqaworker1, sapworker{1-7}
Back