Project

General

Profile

action #78390

Updated by mkittler 9 months ago

## Observation

On ON 2020-11-20 all worker instances on grenache-1 show up as "broken" and checked on grenache-1 with `systemctl status openqa-worker@39` which says:

```
Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
```

so it *seems* seems like baremetal-support.qa.suse.de is not reachable. For more observations see down while openqa.suse.de should have been taken but the different comments. worker is stuck in a retry loop maybe for way too long or forever

## Steps to reproduce

~~Likely Likely can be reproduced by configuring a worker to connect to two web UI hosts where one can not be reached at all (e.g. valid DNS entry but host not up)~~ up)

## Acceptance criteria
* **AC1:** ~~The The configured and reachable webUI hosts are reached while the down host is ignored as long as it is down~~ down
* **AC2:** ~~webUI webUI does not show up as "broken" on a reachable webUI~~ webUI
* **AC3:** ~~worker worker still retries for multiple minutes when a webUI is temporarily down, e.g. during reboot~~ reboot

The fact that there are multiple web UIs involved is not really the issue here. See further comments.

## Suggestions
Maybe when we extended the waiting periods while a webUI is down we introduced this regression
* DONE (at least regarding the initial suspicion): check history of our changes
* DONE (at least regarding the initial suspicion): crosscheck and extend tests
* fix behavior behaviour
* confirm working in OSD infrastructure
* DONE: remove workarounds, e.g. ensure baremetal-support.qa.suse.de is added back to salt-pillars-openqa for grenache-1

## Workaround
Restart Remove webUI hosts that are down from /etc/openqa/workers.ini and restart worker services

Back