action #125810
closed[openqa][infra] Some SUT machines can not upload logs to worker machine size:S
0%
Description
Observation¶
Some SUT machines, I think they are those added back by mr!506, can not upload logs to worker machine.
grenache-1:12/kermit
grenache-1:13/gonzo
grenache-1:15/scooter
grenache-1:19/amd-zen3
No survival after too many retries
Steps to reproduce¶
- Run openQA test with above worker/machine and try to upload logs to worker machine
Impact¶
- All tests assigned to these workers/machines will fail
- Still can not run tests with paired machines
Problem¶
Generally speaking, looks like environment issues.
Suggestion¶
Check firewall settings on worker machineDONE: not the problemCheck other security settings on worker machineDONE: not the problem- Check other limitations imposed by other elements en route, for example, jump/forward machine.
- Open Eng-Infra SD ticket to configure firewall to open required port range (or just allow all traffic)
- See "command server" box on https://open.qa/docs/images/architecture.svg for port range
Workaround¶
- Drop workers from salt (done: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508)
Rollback steps¶
Updated by waynechen55 over 1 year ago
- Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
Updated by waynechen55 over 1 year ago
One more piece of information:
I encountered this similar issue on my personal openQA server, which had been solved by turning firewalld off.
But for this specific issue, I am not sure which element/part is the culprit.
Updated by okurz over 1 year ago
- Tags set to infra, firewall
- Project changed from openQA Project to openQA Infrastructure
- Subject changed from [openqa[infra] Some SUT macines an not upload logs to worker machine to [openqa][infra] Some SUT macines an not upload logs to worker machine
- Target version set to Ready
I suspect it's firewall between the zones
Updated by waynechen55 over 1 year ago
okurz wrote:
I suspect it's firewall between the zones
Can this be fixed today ? Do you need SD-[TICKET] ?
Updated by waynechen55 over 1 year ago
If this can not be fixed soon, I think it is necessary to disable these workers, otherwise virtualization test can not be assigned to functioning workers as normal.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508
Updated by mgriessmeier over 1 year ago
waynechen55 wrote:
If this can not be fixed soon, I think it is necessary to disable these workers, otherwise virtualization test can not be assigned to functioning workers as normal.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508
merged
Updated by livdywan over 1 year ago
- Subject changed from [openqa][infra] Some SUT macines an not upload logs to worker machine to [openqa][infra] Some SUT macines an not upload logs to worker machine size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Tags deleted (
infra, firewall) - Subject changed from [openqa][infra] Some SUT macines an not upload logs to worker machine size:S to [openqa][infra] Some SUT machines an not upload logs to worker machine size:S
- Target version deleted (
Ready)
Updated by okurz over 1 year ago
- Tags set to infra, firewall
- Target version set to Ready
Updated by okurz over 1 year ago
@waynechen55 I suggest to use the usual ticket reporting template that you get when clicking the "report ticket" button from openQA and keep the "link to latest" so that we always have the most recent run of openQA jobs for affected scenarios.
Updated by nicksinger over 1 year ago
- Status changed from Workable to Blocked
- Assignee set to nicksinger
I requested a firewall change in https://sd.suse.com/servicedesk/customer/portal/1/SD-114864
Updated by nicksinger over 1 year ago
- Status changed from Blocked to Feedback
Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810
Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009
Updated by waynechen55 over 1 year ago
nicksinger wrote:
Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810
Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009
job #10689006 passed the previous failed step. Two of them failed at other places.
nicksinger wrote:
Firewall was adjusted by infra. Re-triggered the mentioned tests to check if they can pass the upload stage:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667483 TEST=nsinger_investigation_ipmi_workers_poo125810_kermit BUILD=nsinger_investigation_ipmi_workers_poo125810_kermit _GROUP=0 WORKER_CLASS=kermit_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670636 TEST=nsinger_investigation_ipmi_workers_poo125810_gonzo BUILD=nsinger_investigation_ipmi_workers_poo125810_gonzo _GROUP=0 WORKER_CLASS=gonzo_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10670612 TEST=nsinger_investigation_ipmi_workers_poo125810_scooter BUILD=nsinger_investigation_ipmi_workers_poo125810_scooter _GROUP=0 WORKER_CLASS=gonzo_poo125810 openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10667508 TEST=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 BUILD=nsinger_investigation_ipmi_workers_poo125810_amd-zen3-gpu-sut1 _GROUP=0 WORKER_CLASS=amd-zen3-gpu-sut1_poo125810
Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009
Created job #10689006: sle-15-SP5-Online-x86_64-Build77.1-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689006 In Progress. Passed previous failed "update_package"
Created job #10689010: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689010 Failed. But passed previous failed "update_package"
Created job #10689011: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689011 Failed. But passed previous failed "undate_package"
Created job #10689009: sle-15-SP5-Online-x86_64-Build77.1-prj2_host_upgrade_sles12sp5_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t10689009 Failed. Failed before previous failed "update_package". So can not confirm for this machine. But if this is firewall issue, it might be acceptable if all others passed this step.
Updated by waynechen55 over 1 year ago
Good enough to undo https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/508 ? @nicksinger @okurz
Updated by waynechen55 over 1 year ago
Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/509 to rollback @okurz @nicksinger
Updated by waynechen55 over 1 year ago
okurz wrote:
@waynechen55 I suggest to use the usual ticket reporting template that you get when clicking the "report ticket" button from openQA and keep the "link to latest" so that we always have the most recent run of openQA jobs for affected scenarios.
No problem.
Updated by waynechen55 over 1 year ago
I think all four machines have boot_from_pxe issue again:
grenache-1:12/kermit
grenache-1:13/gonzo
grenache-1:15/scooter
grenache-1:19/amd-zen3
Example, https://openqa.suse.de/tests/10692925#step/boot_from_pxe/6
Additionally, timed-out happens very frequently during test run, example, https://openqa.suse.de/tests/10690784#step/unified_guest_installation/10
Updated by tonyyuan over 1 year ago
The pxe server is not working. You can see the screenshots from links:
https://openqa.suse.de/tests/10691284
https://openqa.suse.de/tests/10691285
The configuration (the 4 machines) in dhcp server seems incorrect. For example, machines cannot get dns server ip from dhcp. Plus, the gonzo get assigned wrong ip address which mismatch DNS name.
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
I looked on qa-jump into journalctl |grep -i <mac_address_of_scooter>
and have not found anything related PXE file deliveries nor error messages. @nicksinger I suggest to find a reproducer openQA job for scooter and any other affected machine, rerun and monitor what the PXE server on qa-jump is doing.
Updated by waynechen55 over 1 year ago
It seems these four machines are not configured in qanet-configs anymore which has no relevant entries.
kermit
gonzo
scooter
amd-zen3
So where are they ?
@okurz @nicksinger
Updated by waynechen55 over 1 year ago
We need to continue our RC1 testing. So if no problem, please help merge this pull request https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/512.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Resolved
We discovered a bug in our PXE config generation which gets hopefully fixed by https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3
I'm resolving this here now in favor of https://progress.opensuse.org/issues/119551 because the initial issue reported (machines cannot upload logs because of firewall) is resolved in the meantime.
Updated by waynechen55 over 1 year ago
nicksinger wrote:
We discovered a bug in our PXE config generation which gets hopefully fixed by https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3
I'm resolving this here now in favor of https://progress.opensuse.org/issues/119551 because the initial issue reported (machines cannot upload logs because of firewall) is resolved in the meantime.
Thanks. So let's wait for your result with patience.
Updated by okurz over 1 year ago
waynechen55 wrote:
It seems these four machines are not configured in qanet-configs anymore which has no relevant entries.
kermit
gonzo
scooter
amd-zen3
So where are they ?
@okurz @nicksinger
To answer that question as well please see https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs which references https://gitlab.suse.de/OPS-Service/salt/ which includes the DHCP+DNS config for FC Basement
Updated by tinita over 1 year ago
- Subject changed from [openqa][infra] Some SUT machines an not upload logs to worker machine size:S to [openqa][infra] Some SUT machines can not upload logs to worker machine size:S