Project

General

Profile

action #159303

Updated by okurz 6 months ago

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604 

 ``` 
 ERROR: Minions returned with non-zero exit code 
 worker32.oqa.prg2.suse.org: 
 openqaworker18.qa.suse.cz: 
 worker31.oqa.prg2.suse.org: 
 worker35.oqa.prg2.suse.org: 
 worker33.oqa.prg2.suse.org: 
 worker29.oqa.prg2.suse.org: 
 worker30.oqa.prg2.suse.org: 
 worker40.oqa.prg2.suse.org: 
 worker34.oqa.prg2.suse.org: 
 openqaworker16.qa.suse.cz: 
 openqaworker17.qa.suse.cz: 
 worker-arm2.oqa.prg2.suse.org: 
 worker-arm1.oqa.prg2.suse.org: 
 qesapworker-prg4.qa.suse.cz: 
 qesapworker-prg6.qa.suse.cz: 
 qesapworker-prg5.qa.suse.cz: 
 qesapworker-prg7.qa.suse.cz: 
 sapworker2.qe.nue2.suse.org: 
 sapworker3.qe.nue2.suse.org: 
 sapworker1.qe.nue2.suse.org: 
 openqaworker14.qa.suse.cz: 
 mania.qe.nue2.suse.org: 
 petrol.qe.nue2.suse.org: 
 openqaworker1.qe.nue2.suse.org: 
 imagetester.qe.nue2.suse.org: 
 diesel.qe.nue2.suse.org: 
 grenache-1.oqa.prg2.suse.org: 
 openqaworker-arm-1.qe.nue2.suse.org: 
     Minion did not return. [Not connected] 
 ``` 

 ## Suggestions 
 * **DONE** Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far 
 * Ensure that after the machine is up again in #159270 the deployment is retried and works fine 
 * Wait for #157753 to have complete automation 
 * This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough 
 * Teach livdywan that it was just one machine failing, not all the listed ones 

 ## Out of scope 
 * Recovering openqaworker-arm-1: #159270

Back