Project

General

Profile

action #132827

Updated by rfan1 10 months ago

## Observation 
 I can see that some tests are failing due to DNS resolve issue on workers "sapworker*", especially on multi-machine tests.can someone help check? 

 Some error messages as below: 
 https://openqa.suse.de/tests/11593878#step/salt_master/15 
 http://openqa.suse.de/tests/11594635#step/rsync_client/12 

 ## Reproducible 

 [Failed test links](https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&arch=&flavor=&machine=&test=&modules=salt_master%2Crsync_client&module_re=&distri=sle&build=20230716-1&groupid=414#) 


 ## Expected result 

 I Tried with another worker to run the rsync tests without any issue: http://openqa.suse.de/tests/11594925#dependencies 

 ## Rollback steps 

 Add back production worker class on sapworker{1,2,3}, i.e. revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/564 

 ## Further details 

 May be some network problems with workers "sapworker*", based on my tests [at least for rsync test result], the same test can pass with "worker5" but fail with "sapworker1" 


 ## Suggestions 
 - First ensure that all openQA workers have the salt state applied cleanly, e.g. `sudo salt --no-color -C 'G@roles:worker' state.apply` 
 - Maybe the failure can be improved on the os-autoinst side, like a better "die"message/reason 
 - As temporary measure consider disabling the "tap" class from affected workers, e.g. make it tap_pooXXX 
 - Debug multi-machine capabilities according to http://open.qa/docs/#_verify_the_setup 
 - Ensure that our salt states ensure all what is needed to run stable multi-machine tests 
 - Add back production worker classes for all affected machines openqaworker1, worker5, sapworker{1-7}

Back