Project

General

Profile

action #78218

openQA Project - coordination #80142: [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[openQA][worker] Almost all openQA workers become offline

Added by waynechen55 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-11-19
Due date:
% Done:

0%

Estimated time:

Description

Almost all openQA workers become offline. So many openQA jobs stop running.

Please refer to https://openqa.suse.de/admin/workers.


Related issues

Related to openQA Infrastructure - coordination #78206: [epic] 2020-11-18 nbg power outage aftermathBlocked2020-11-27

Copied to openQA Project - action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Feedback2021-01-18

Copied to openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupIn Progress

History

#1 Updated by waynechen55 2 months ago

waynechen55 wrote:

Almost all openQA workers become offline. So many openQA jobs stop running.

Please refer to https://openqa.suse.de/admin/workers.

These still broken:

grenache-1:10 grenache-1 64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20

grenache-1:16 grenache-1 64bit-ipmi,grenache-1 ppc64le Broken 1 20

grenache-1:17 grenache-1 virt-pvusb-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20

grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20

grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20

openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20

openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20

openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20

openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20

#2 Updated by okurz 2 months ago

#3 Updated by okurz 2 months ago

  • Status changed from New to Resolved
  • Assignee set to okurz
  • Target version set to Ready

I tried to fix most yesterday evening as described in more detail in #78206 . More have been fixed today by others. Any "virtualization jump hosts" and non-qemu based worker host addendums might still be down though, out of scope for us as a team though.

#4 Updated by okurz 2 months ago

  • Status changed from Resolved to Feedback
  • Assignee changed from okurz to waynechen55
  • Target version changed from Ready to future

ok, waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)

#5 Updated by waynechen55 2 months ago

okurz wrote:

ok, waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)

Let's take the following two as example:
grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20

grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20

I double checked that the two SUTs behind these two workers is functioning as normal and can be reached by ipmitool. So there is nothing wrong with SUTs. The problem goes with the worker and worker machine. Do you think I still need to contact infra ? I think this can fixed on worker/worker machine side.

#7 Updated by okurz 2 months ago

  • Copied to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added

#9 Updated by okurz 2 months ago

  • Status changed from Feedback to Blocked
  • Assignee changed from waynechen55 to okurz
  • Target version changed from future to Ready

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:

Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

#10 Updated by waynechen55 2 months ago

okurz wrote:

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:

Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

Thanks very much for your help. I will double check again later.

#11 Updated by waynechen55 about 2 months ago

okurz wrote:

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:

Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-sriov,openqaworker2 x86_64 Offline 1 20

openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20

openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20

openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20

#12 Updated by okurz about 2 months ago

waynechen55 wrote:

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445

#13 Updated by waynechen55 about 2 months ago

okurz wrote:

waynechen55 wrote:

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445

Thanks.

#14 Updated by okurz about 2 months ago

  • Estimated time set to 80142.00 h

#15 Updated by okurz about 2 months ago

  • Estimated time deleted (80142.00 h)

#16 Updated by okurz about 2 months ago

  • Parent task set to #80142

#17 Updated by okurz about 2 months ago

  • Copied to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added

#18 Updated by okurz about 2 months ago

  • Status changed from Blocked to Resolved

We could fix #76786 and #75445 and also #78438 to cleanup all old, misleading entries in the worker table.

I checked openqaworker2:21 through openqaworker2:24 and they were fine.

For the machines qa-power8-5-kvm we have #80482 and for powerqaworker-qam-1 we have #68053

Also available in: Atom PDF