Project

General

Profile

Actions

action #78390

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-01-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

On 2020-11-20 all worker instances on grenache-1 show up as "broken" and checked on grenache-1 with systemctl status openqa-worker@39 which says:

Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

so it seems like baremetal-support.qa.suse.de is not reachable. For more observations see the different comments.

Steps to reproduce

Likely can be reproduced by configuring a worker to connect to two web UI hosts where one can not be reached at all (e.g. valid DNS entry but host not up)

Acceptance criteria

  • AC1: The configured and reachable webUI hosts are reached while the down host is ignored as long as it is down
  • AC2: webUI does not show up as "broken" on a reachable webUI
  • AC3: worker still retries for multiple minutes when a webUI is temporarily down, e.g. during reboot

The fact that there are multiple web UIs involved is not really the issue here. See further comments.

Suggestions

Maybe when we extended the waiting periods while a webUI is down we introduced this regression

  • DONE (at least regarding the initial suspicion): check history of our changes
  • DONE (at least regarding the initial suspicion): crosscheck and extend tests
  • fix behavior
  • confirm working in OSD infrastructure
  • DONE: remove workarounds, e.g. ensure baremetal-support.qa.suse.de is added back to salt-pillars-openqa for grenache-1

Workaround

Restart worker services


Related issues 7 (0 open7 closed)

Related to openQA Infrastructure - action #80768: All workers in grenache-1 are broken at 2020-12-07Resolvedmkittler2020-12-07

Actions
Related to openQA Infrastructure - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2Resolvedlivdywan

Actions
Related to openQA Infrastructure - action #81046: openqaworker-arm-2.suse.de unreachableResolvedlivdywan2020-12-15

Actions
Related to openQA Infrastructure - action #81210: workers in grenache-1 are brokenResolvedXiaojing_liu2020-12-21

Actions
Related to openQA Project - action #108091: Most systemd units should not Want= or Require= network.target (bsc#1196359) size:MResolvedokurz2022-03-09

Actions
Related to openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
Copied from openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz2020-11-19

Actions
Actions

Also available in: Atom PDF