action #78390
Updated by mkittler over 4 years ago
## Observation On ON 2020-11-20 all worker instances on grenache-1 show up as "broken" and checked on grenache-1 with `systemctl status openqa-worker@39` which says: ``` Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds ``` so it *seems* seems like baremetal-support.qa.suse.de is not reachable. For more observations see down while openqa.suse.de should have been taken but the different comments. worker is stuck in a retry loop maybe for way too long or forever ## Steps to reproduce ~~Likely Likely can be reproduced by configuring a worker to connect to two web UI hosts where one can not be reached at all (e.g. valid DNS entry but host not up)~~ up) ## Acceptance criteria * **AC1:** ~~The The configured and reachable webUI hosts are reached while the down host is ignored as long as it is down~~ down * **AC2:** ~~webUI webUI does not show up as "broken" on a reachable webUI~~ webUI * **AC3:** ~~worker worker still retries for multiple minutes when a webUI is temporarily down, e.g. during reboot~~ reboot The fact that there are multiple web UIs involved is not really the issue here. See further comments. ## Suggestions Maybe when we extended the waiting periods while a webUI is down we introduced this regression * DONE (at least regarding the initial suspicion): check history of our changes * DONE (at least regarding the initial suspicion): crosscheck and extend tests * fix behavior behaviour * confirm working in OSD infrastructure * DONE: remove workarounds, e.g. ensure baremetal-support.qa.suse.de is added back to salt-pillars-openqa for grenache-1 ## Workaround Restart Remove webUI hosts that are down from /etc/openqa/workers.ini and restart worker services