Project

General

Profile

action #78390

Updated by mkittler over 3 years ago

## Observation 

 On ON 2020-11-20 all worker instances on grenache-1 show up as "broken" and checked on grenache-1 with `systemctl status openqa-worker@39` which says: 

 ``` 
 Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de 
 Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds 
 Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de 
 Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds 
 ``` 

 so it *seems* seems like baremetal-support.qa.suse.de is not reachable. For more observations see down while openqa.suse.de should have been taken but the different comments. worker is stuck in a retry loop maybe for way too long or forever 

 ## Steps to reproduce 

 ~~Likely Likely can be reproduced by configuring a worker to connect to two web UI hosts where one can not be reached at all (e.g. valid DNS entry but host not up)~~ up) 

 ## Acceptance criteria 
 * **AC1:** ~~The The configured and reachable webUI hosts are reached while the down host is ignored as long as it is down~~ down 
 * **AC2:** ~~webUI webUI does not show up as "broken" on a reachable webUI~~ webUI 
 * **AC3:** ~~worker worker still retries for multiple minutes when a webUI is temporarily down, e.g. during reboot~~ reboot 

 The fact that there are multiple web UIs involved is not really the issue here. See further comments. 

 ## Suggestions 
 Maybe when we extended the waiting periods while a webUI is down we introduced this regression 
 * DONE (at least regarding the initial suspicion): check history of our changes 
 * DONE (at least regarding the initial suspicion): crosscheck and extend tests 
 * fix behavior behaviour 
 * confirm working in OSD infrastructure 
 * DONE: remove workarounds, e.g. ensure baremetal-support.qa.suse.de is added back to salt-pillars-openqa for grenache-1 

 ## Workaround 
 Restart Remove webUI hosts that are down from /etc/openqa/workers.ini and restart worker services

Back