Project

General

Profile

Actions

action #120025

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

[openQA][ipmi][worker] Worker host hostname changed and broken networking connection

Added by waynechen55 almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-11-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

All virtualization workers had names like grenache-1:x or openqaworker-2:y. But current openQA ipmi worker page shows irregular ipmi worker names f83:xxx as below:

It looks to me there is something wrong with worker host itself. I also double checked salt pillar repo on gitlab which does not have such worker host name, f83.
All all tests that run on these f83:xxx worker will fail, for example, failure1 and failure2.

Steps to reproduce

  • Navitgate to openQA workers page and filter all ipmi worker out.
  • Run a test with such worker

Impact

All tests run with these workers will definitely fail.

Problem

It seems there is something wrong with worker host itself.

Suggestion

  • Check worker host networking environment
  • Check ipmi workers config on worker host

Workaround

n/a


Files

Selection_103.png (77.4 KB) Selection_103.png waynechen55, 2022-11-07 11:50
Selection_103.png (77.4 KB) Selection_103.png waynechen55, 2022-11-07 11:50
ipmi_worker_01.png (85.1 KB) ipmi_worker_01.png waynechen55, 2022-11-08 01:50
ipmi_worker_02.png (81 KB) ipmi_worker_02.png waynechen55, 2022-11-08 01:51

Related issues 1 (0 open1 closed)

Related to QA - action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:MResolvedokurz2022-11-17

Actions
Actions #1

Updated by waynechen55 almost 2 years ago

We are currently in Beta1 testing phase, so these worker are crucial.

Actions #2

Updated by dzedro almost 2 years ago

Not sure if the s390x failures are also related to the hostname change. https://openqa.suse.de/tests/9890767

Actions #5

Updated by okurz almost 2 years ago

  • Related to action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M added
Actions #6

Updated by okurz almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready
  • Parent task set to #116623

Working on that, related to #119443. I mentioned in the Slack chat channel #discuss-qe-new-security-zones

@Lazaros Haleplidis https://progress.opensuse.org/issues/120025 mentions problems of openQA tests failing to access our bare metal test hosts, i.e. all hosts mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls. Just one example: sp.fozzie.qa.suse.de over IPMI

f83 was a temporary name of openqaworker2 while it was being migrated into a new network security zone and was without proper hostname for some time.

Actions #7

Updated by okurz almost 2 years ago

  • Status changed from In Progress to Resolved

f83 was a temporary name of openqaworker2 while it was being migrated into a new network security zone and was without proper hostname for some time. f83 will not mess with tests anymore.

dzedro wrote:

Not sure if the s390x failures are also related to the hostname change. https://openqa.suse.de/tests/9890767

Yes, same problem. Thank you for bringing this up and finding the right issue report :)

I fixed the config for WORKER_HOSTNAME on worker2 now and retriggered all according tests with:

WORKER=worker2 result="result='failed'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=worker2 result="result='incomplete'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=f83 result="result='failed'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log
WORKER=f83 result="result='incomplete'" failed_since=2022-11-07 host=openqa.suse.de openqa-advanced-retrigger-jobs | tee -a worker2_restart_$(date +%F).log

I am monitoring tests as part of #119443

Actions

Also available in: Atom PDF