action #132827
Updated by livdywan about 1 year ago
## Observation I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check? Some error messages as below: https://openqa.suse.de/tests/11593878#step/salt_master/15 http://openqa.suse.de/tests/11594635#step/rsync_client/12 ## Reproducible [Failed test links](https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&arch=&flavor=&machine=&test=&modules=salt_master%2Crsync_client&module_re=&distri=sle&build=20230716-1&groupid=414#) ## Expected result I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies ## Rollback steps Add back production worker class on sapworker{1,2,3}, i.e. revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564 ## Further details May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1" ## Suggestions - First ensure that all openQA workers have the salt state applied cleanly, e.g. `sudo salt --no-color -C 'G@roles:worker' state.apply` - Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason - As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX - Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup - Ensure that our salt states ensure all what is needed to run stable multi-machine tests - Add back production worker classes for all affected machines openqaworker1, worker5, sapworker{1-7}