Project

General

Profile

Actions

action #125810

closed

[openqa][infra] Some SUT machines can not upload logs to worker machine size:S

Added by waynechen55 over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-03-13
Due date:
2023-03-27
% Done:

0%

Estimated time:

Description

Observation

Some SUT machines, I think they are those added back by mr!506, can not upload logs to worker machine.
grenache-1:12/kermit
grenache-1:13/gonzo
grenache-1:15/scooter
grenache-1:19/amd-zen3

No survival after too many retries

Steps to reproduce

  • Run openQA test with above worker/machine and try to upload logs to worker machine

Impact

  • All tests assigned to these workers/machines will fail
  • Still can not run tests with paired machines

Problem

Generally speaking, looks like environment issues.

Suggestion

  • Check firewall settings on worker machine DONE: not the problem
  • Check other security settings on worker machine DONE: not the problem
  • Check other limitations imposed by other elements en route, for example, jump/forward machine.
  • Open Eng-Infra SD ticket to configure firewall to open required port range (or just allow all traffic)
  • See "command server" box on https://open.qa/docs/images/architecture.svg for port range

Workaround

Rollback steps


Related issues 1 (0 open1 closed)

Related to QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:MResolvednicksinger2023-03-10

Actions
Actions #1

Updated by waynechen55 over 1 year ago

  • Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
Actions #2

Updated by waynechen55 over 1 year ago

One more piece of information:
I encountered this similar issue on my personal openQA server, which had been solved by turning firewalld off.

But for this specific issue, I am not sure which element/part is the culprit.

Actions #3

Updated by okurz over 1 year ago

  • Tags set to infra, firewall
  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from [openqa[infra] Some SUT macines an not upload logs to worker machine to [openqa][infra] Some SUT macines an not upload logs to worker machine
  • Target version set to Ready

I suspect it's firewall between the zones

Actions #4

Updated by waynechen55 over 1 year ago

okurz wrote:

I suspect it's firewall between the zones

Can this be fixed today ? Do you need SD-[TICKET] ?

Actions #5

Updated by waynechen55 over 1 year ago

If this can not be fixed soon, I think it is necessary to disable these workers, otherwise virtualization test can not be assigned to functioning workers as normal.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508

@okurz

Actions #6

Updated by mgriessmeier over 1 year ago

waynechen55 wrote:

If this can not be fixed soon, I think it is necessary to disable these workers, otherwise virtualization test can not be assigned to functioning workers as normal.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508

@okurz

merged

Actions #7

Updated by livdywan over 1 year ago

  • Subject changed from [openqa][infra] Some SUT macines an not upload logs to worker machine to [openqa][infra] Some SUT macines an not upload logs to worker machine size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz over 1 year ago

  • Tags deleted (infra, firewall)
  • Subject changed from [openqa][infra] Some SUT macines an not upload logs to worker machine size:S to [openqa][infra] Some SUT machines an not upload logs to worker machine size:S
  • Target version deleted (Ready)
Actions #9

Updated by okurz over 1 year ago

  • Tags set to infra, firewall
  • Target version set to Ready
Actions #10

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #11

Updated by okurz over 1 year ago

@waynechen55 I suggest to use the usual ticket reporting template that you get when clicking the "report ticket" button from openQA and keep the "link to latest" so that we always have the most recent run of openQA jobs for affected scenarios.

Actions #12

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to nicksinger
Actions #13

Updated by nicksinger over 1 year ago

  • Status changed from Blocked to Feedback

Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810

Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009

Actions #14

Updated by waynechen55 over 1 year ago

nicksinger wrote:

Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810

Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009

job #10689006 passed the previous failed step. Two of them failed at other places.

nicksinger wrote:

Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810

Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009

Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006 In Progress. Passed previous failed "update_package"
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010 Failed. But passed previous failed "update_package"
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011 Failed. But passed previous failed "undate_package"
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009 Failed. Failed before previous failed "update_package". So can not confirm for this machine. But if this is firewall issue, it might be acceptable if all others passed this step.

Actions #17

Updated by waynechen55 over 1 year ago

okurz wrote:

@waynechen55 I suggest to use the usual ticket reporting template that you get when clicking the "report ticket" button from openQA and keep the "link to latest" so that we always have the most recent run of openQA jobs for affected scenarios.

No problem.

Actions #18

Updated by waynechen55 over 1 year ago

I think all four machines have boot_from_pxe issue again:
grenache-1:12/kermit
grenache-1:13/gonzo
grenache-1:15/scooter
grenache-1:19/amd-zen3

Example, https://openqa.suse.de/tests/10692925#step/boot_from_pxe/6

Additionally, timed-out happens very frequently during test run, example, https://openqa.suse.de/tests/10690784#step/unified_guest_installation/10

Actions #19

Updated by tonyyuan over 1 year ago

The pxe server is not working. You can see the screenshots from links:
https://openqa.suse.de/tests/10691284
https://openqa.suse.de/tests/10691285

The configuration (the 4 machines) in dhcp server seems incorrect. For example, machines cannot get dns server ip from dhcp. Plus, the gonzo get assigned wrong ip address which mismatch DNS name.

Actions #20

Updated by okurz over 1 year ago

  • Due date set to 2023-03-27
Actions #21

Updated by okurz over 1 year ago

  • Status changed from Feedback to In Progress

I looked on qa-jump into journalctl |grep -i <mac_address_of_scooter> and have not found anything related PXE file deliveries nor error messages. @nicksinger I suggest to find a reproducer openQA job for scooter and any other affected machine, rerun and monitor what the PXE server on qa-jump is doing.

Actions #22

Updated by waynechen55 over 1 year ago

It seems these four machines are not configured in qanet-configs anymore which has no relevant entries.
kermit
gonzo
scooter
amd-zen3
So where are they ?
@okurz @nicksinger

Actions #23

Updated by waynechen55 over 1 year ago

We need to continue our RC1 testing. So if no problem, please help merge this pull request https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/512.

Actions #24

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Resolved

We discovered a bug in our PXE config generation which gets hopefully fixed by https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3
I'm resolving this here now in favor of https://progress.opensuse.org/issues/119551 because the initial issue reported (machines cannot upload logs because of firewall) is resolved in the meantime.

Actions #25

Updated by waynechen55 over 1 year ago

nicksinger wrote:

We discovered a bug in our PXE config generation which gets hopefully fixed by https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3
I'm resolving this here now in favor of https://progress.opensuse.org/issues/119551 because the initial issue reported (machines cannot upload logs because of firewall) is resolved in the meantime.

Thanks. So let's wait for your result with patience.

Actions #26

Updated by okurz over 1 year ago

waynechen55 wrote:

It seems these four machines are not configured in qanet-configs anymore which has no relevant entries.
kermit
gonzo
scooter
amd-zen3
So where are they ?
@okurz @nicksinger

To answer that question as well please see https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs which references https://gitlab.suse.de/OPS-Service/salt/ which includes the DHCP+DNS config for FC Basement

Actions #27

Updated by tinita over 1 year ago

  • Subject changed from [openqa][infra] Some SUT machines an not upload logs to worker machine size:S to [openqa][infra] Some SUT machines can not upload logs to worker machine size:S
Actions

Also available in: Atom PDF