action #78218: [openQA][worker] Almost all openQA workers become offline - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #78218

closed

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[openQA][worker] Almost all openQA workers become offline

Added by waynechen55 over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-11-19

Due date:

% Done:

Estimated time:

Description

Almost all openQA workers become offline. So many openQA jobs stop running.

Please refer to https://openqa.suse.de/admin/workers.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by waynechen55 over 4 years ago

waynechen55 wrote:

Almost all openQA workers become offline. So many openQA jobs stop running.

Please refer to https://openqa.suse.de/admin/workers.

These still broken:

grenache-1:10 grenache-1 64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20
grenache-1:16 grenache-1 64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:17 grenache-1 virt-pvusb-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20
grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20

Actions

Copy link

Updated by okurz over 4 years ago

Related to coordination #78206: [epic] 2020-11-18 nbg power outage aftermath added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from New to Resolved
Assignee set to okurz
Target version set to Ready

I tried to fix most yesterday evening as described in more detail in #78206 . More have been fixed today by others. Any "virtualization jump hosts" and non-qemu based worker host addendums might still be down though, out of scope for us as a team though.

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Resolved to Feedback
Assignee changed from okurz to waynechen55
Target version changed from Ready to future

ok, @waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)

Actions

Copy link

Updated by waynechen55 over 4 years ago

okurz wrote:

ok, @waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)

Let's take the following two as example:
grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20

I double checked that the two SUTs behind these two workers is functioning as normal and can be reached by ipmitool. So there is nothing wrong with SUTs. The problem goes with the worker and worker machine. Do you think I still need to contact infra ? I think this can fixed on worker/worker machine side.

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Feedback to Blocked
Assignee changed from waynechen55 to okurz
Target version changed from future to Ready

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:

Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

Actions

Copy link

#10

Updated by waynechen55 over 4 years ago

okurz wrote:

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:
Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

Thanks very much for your help. I will double check again later.

Actions

Copy link

#11

Updated by waynechen55 over 4 years ago

okurz wrote:

Right, no problems with machines behind the workers. Thank you for checking.

I checked on grenache-1 with systemctl status openqa-worker@39 which says:
Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.

But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390

I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.

I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-sriov,openqaworker2 x86_64 Offline 1 20
openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20
openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20
openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20

Actions

Copy link

#12

Updated by okurz over 4 years ago

waynechen55 wrote:

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445

Actions

Copy link

#13

Updated by waynechen55 over 4 years ago

okurz wrote:

waynechen55 wrote:

Are these four shown up in openqa.suse.de obsolete ? They are still offline.

they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445

Thanks.

Actions

Copy link

#14