action #78218
closedopenQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[openQA][worker] Almost all openQA workers become offline
0%
Description
Almost all openQA workers become offline. So many openQA jobs stop running.
Please refer to https://openqa.suse.de/admin/workers.
Updated by waynechen55 almost 4 years ago
waynechen55 wrote:
Almost all openQA workers become offline. So many openQA jobs stop running.
Please refer to https://openqa.suse.de/admin/workers.
These still broken:
grenache-1:10 grenache-1 64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20
grenache-1:16 grenache-1 64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:17 grenache-1 virt-pvusb-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,grenache-1 ppc64le Broken 1 20
grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Working 1 20
Updated by okurz almost 4 years ago
- Related to coordination #78206: [epic] 2020-11-18 nbg power outage aftermath added
Updated by okurz almost 4 years ago
- Status changed from New to Resolved
- Assignee set to okurz
- Target version set to Ready
I tried to fix most yesterday evening as described in more detail in #78206 . More have been fixed today by others. Any "virtualization jump hosts" and non-qemu based worker host addendums might still be down though, out of scope for us as a team though.
Updated by okurz almost 4 years ago
- Status changed from Resolved to Feedback
- Assignee changed from okurz to waynechen55
- Target version changed from Ready to future
ok, @waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)
Updated by waynechen55 almost 4 years ago
okurz wrote:
ok, @waynechen55 I think it's better if I don't close this ticket as I did not confirm if any more machines work now. For all the physical machines that are behind the worker instances that you mentioned I suggest you check yourself what is the status and report issues to EngInfra, e.g. mail to infra@suse.de or report on https://infra.nue.suse.com/
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls has all information that I am aware of and can you help to track down these machines. I also don't know more about them :)
Let's take the following two as example:
grenache-1:39 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
grenache-1:40 grenache-1 virt-arm-64bit-ipmi,arm-64bit-ipmi,grenache-1 ppc64le Broken 1 20
I double checked that the two SUTs behind these two workers is functioning as normal and can be reached by ipmitool. So there is nothing wrong with SUTs. The problem goes with the worker and worker machine. Do you think I still need to contact infra ? I think this can fixed on worker/worker machine side.
Updated by okurz almost 4 years ago
- Copied to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added
Updated by okurz almost 4 years ago
- Status changed from Feedback to Blocked
- Assignee changed from waynechen55 to okurz
- Target version changed from future to Ready
Right, no problems with machines behind the workers. Thank you for checking.
I checked on grenache-1 with systemctl status openqa-worker@39
which says:
Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de
Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.
But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390
I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.
I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396
Updated by waynechen55 almost 4 years ago
okurz wrote:
Right, no problems with machines behind the workers. Thank you for checking.
I checked on grenache-1 with
systemctl status openqa-worker@39
which says:Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.
But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390
I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.
I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396
Thanks very much for your help. I will double check again later.
Updated by waynechen55 almost 4 years ago
okurz wrote:
Right, no problems with machines behind the workers. Thank you for checking.
I checked on grenache-1 with
systemctl status openqa-worker@39
which says:Nov 20 06:40:46 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:46 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds Nov 20 06:40:56 grenache-1 worker[6888]: [info] [pid:6888] Registering with openQA baremetal-support.qa.suse.de Nov 20 06:40:56 grenache-1 worker[6888]: [warn] [pid:6888] Failed to register at baremetal-support.qa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
The problem seems to be that baremetal-support.qa.suse.de is down. Created #78396 for that specific issue.
But the openQA worker should fall back to openqa.suse.de when multiple workers are configured. This seems to be a regression which I recorded in #78390
I have temporarily removed baremetal-support.qa.suse.de from grenache-1:/etc/openqa/workers.ini and all jobs are running fine right now.
I have created and merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/278 to exclude the host baremetal-support.qa.suse.de while it is down, waiting for #78396
Are these four shown up in openqa.suse.de obsolete ? They are still offline.
openqaworker2:21 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-sriov,openqaworker2 x86_64 Offline 1 20
openqaworker2:22 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20
openqaworker2:23 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20
openqaworker2:24 openqaworker2 virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2 x86_64 Offline 1 20
Updated by okurz almost 4 years ago
waynechen55 wrote:
Are these four shown up in openqa.suse.de obsolete ? They are still offline.
they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445
Updated by waynechen55 almost 4 years ago
okurz wrote:
waynechen55 wrote:
Are these four shown up in openqa.suse.de obsolete ? They are still offline.
they are not obsolete but currently show as offline because the worker service show up under the wrong name. If you search for "64bit-ipmi-large-mem" you can actually find "linux-1nn1:21" through "linux-1nn1:24" running fine but under the wrong hostname. These are known issues, recorded in #76786 and #75445
Thanks.
Updated by okurz almost 4 years ago
- Copied to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added
Updated by okurz almost 4 years ago
- Status changed from Blocked to Resolved