Project

General

Profile

action #120025

[openQA][ipmi][worker] Worker host hostname changed and broken networking connection

Added by waynechen55 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-11-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

All virtualization workers had names like grenache-1:x or openqaworker-2:y. But current openQA ipmi worker page shows irregular ipmi worker names f83:xxx as below:

It looks to me there is something wrong with worker host itself. I also double checked salt pillar repo on gitlab which does not have such worker host name, f83.
All all tests that run on these f83:xxx worker will fail, for example, failure1 and failure2.

Steps to reproduce

  • Navitgate to openQA workers page and filter all ipmi worker out.
  • Run a test with such worker

Impact

All tests run with these workers will definitely fail.

Problem

It seems there is something wrong with worker host itself.

Suggestion

  • Check worker host networking environment
  • Check ipmi workers config on worker host

Workaround

n/a

Selection_103.png (77.4 KB) Selection_103.png waynechen55, 2022-11-07 11:50
Selection_103.png (77.4 KB) Selection_103.png waynechen55, 2022-11-07 11:50
ipmi_worker_01.png (85.1 KB) ipmi_worker_01.png waynechen55, 2022-11-08 01:50
ipmi_worker_02.png (81 KB) ipmi_worker_02.png waynechen55, 2022-11-08 01:51
14059
14062
14071
14074

History

#1 Updated by waynechen55 3 months ago

We are currently in Beta1 testing phase, so these worker are crucial.

#2 Updated by dzedro 3 months ago

Not sure if the s390x failures are also related to the hostname change. https://openqa.suse.de/tests/9890767

#6 Updated by okurz 3 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready
  • Parent task set to #116623

Working on that, related to #119443. I mentioned in the Slack chat channel #discuss-qe-new-security-zones

@Lazaros Haleplidis https://progress.opensuse.org/issues/120025 mentions problems of openQA tests failing to access our bare metal test hosts, i.e. all hosts mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls. Just one example: sp.fozzie.qa.suse.de over IPMI

f83 was a temporary name of openqaworker2 while it was being migrated into a new network security zone and was without proper hostname for some time.

#7 Updated by okurz 3 months ago

  • Status changed from In Progress to Resolved

f83 was a temporary name of openqaworker2 while it was being migrated into a new network security zone and was without proper hostname for some time. f83 will not mess with tests anymore.

dzedro wrote:

Not sure if the s390x failures are also related to the hostname change. https://openqa.suse.de/tests/9890767

Yes, same problem. Thank you for bringing this up and finding the right issue report :)

I fixed the config for WORKER_HOSTNAME on worker2 now and retriggered all according tests with:

WORKER=worker2 result="result='failed'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=worker2 result="result='incomplete'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=f83 result="result='failed'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=f83 result="result='incomplete'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log

I am monitoring tests as part of #119443

Also available in: Atom PDF