action #126188
closed[openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M
0%
Description
Observation¶
It seems that current openQA infra performance is still not good enough to run openQA test suites smoothly, because there are still lots of test runs failed due to environment issue, for example:
grenache-1:13/gonzo failed at host_upgrade_generate_run_file. Can not resolve host grenache-1.qa.suse.de
grenache-1:10/openqaipmi5 failed at boot_from_pxe Error connecting to root@openqaipmi5.qa.suse.de
grenache-1:12/kermit failed at boot_from_pxe Can not find kernel image
grenache-1:19/amd-zen3-gpu-sut1-1 failed at failed at host_upgrade_generate_run_file. Can not resolve host grenache-1.qa.suse.de
grenache-1:10/openqaipmi5 failed at boot_from_pxe Command timed out
grenache-1:15/scooter failed at update_package Can not resolve host grenache-1.qa.suse.de
By the way, I do not use openQA report link directly because here many different cases are involved.
Steps to reproduce¶
- Run virtualization openQA test suites
Impact¶
- Virtualization openQA test run can not pass or fail smoothly and in a timely manner.
Problem¶
- Generally speaking, this looks like infra or environment issue.
Suggestion¶
- Network segmentation
- firewall/smart routes traversed
- infra performance
- infra configuration
Workaround¶
n/a
Updated by waynechen55 almost 2 years ago
- Subject changed from [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to test run failure tangible to [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure
Updated by mkittler almost 2 years ago
We have two different issues here (which you normally shouldn't mix):
- Tests using the IPMI backend are failing very soon: https://openqa.suse.de/tests/10728059#step/boot_from_pxe/11
- Not sure yet why that is.
- Tests failing to upload logs later on: https://openqa.suse.de/tests/10728055#step/update_package/8
- Previous uploads work. It looks like the SUT might not be ready at the point the upload is attempted (although the SSH connection to it could be established).
Updated by livdywan almost 2 years ago
- Subject changed from [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure to [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M
- Status changed from New to Feedback
- Assignee set to mkittler
- Target version set to Ready
Updated by mkittler almost 2 years ago
Regarding openqaipmi5.qa.suse.de: It looks like https://openqa.suse.de/tests/10731111 is running now. Also previous jobs that have already finished seemed to get past the setup like https://openqa.suse.de/tests/10730149. That job then ended up running into the second issue (uploading doesn't work after reboot). Not sure whether we need to do anything regarding the setup issue considering it seems to work again.
And the uploading issue really looks like a network issue within the SUT itself, especially since the test actually was able to upload some other logs.
Updated by okurz over 1 year ago
One important point to mention is that openqaipmi5 is located in NUE1-SRV2 so it is not just a simple problem related to the move to FC Basement
Updated by mkittler over 1 year ago
I say it is not about our setup because:
- Other uploads as part of that job worked. So the firewall is generally not blocking traffic.
- It happens shortly after rebooting the SUT indicating hostname resolution is just not ready at this point.
So can you check the state of the SUT, e.g. using the developer mode?
Updated by MDoucha over 1 year ago
My first guess would be that one of the switches in the server room is running in hub mode. Most likely the one that IPMI SUTs are plugged into. That'll cause serious performance issues on busy networks.
Updated by waynechen55 over 1 year ago
mkittler wrote:
I say it is not about our setup because:
- Other uploads as part of that job worked. So the firewall is generally not blocking traffic.
- It happens shortly after rebooting the SUT indicating hostname resolution is just not ready at this point.
So can you check the state of the SUT, e.g. using the developer mode?
What developer mode can do for this case ? I can not figure it out.
And which case supports this judgement ?
It happens shortly after rebooting the SUT indicating hostname resolution is just not ready at this point.
I have been keeping an eye on test runs. Still can not have a clue.
Updated by mkittler over 1 year ago
What developer mode can do for this case ? I can not figure it out.
You could make the test pause at host_upgrade_generate_run_file
and investigate the networking problems on the SUT manually.
And which case supports this judgement?
If it was a firewall issue or the command server of os-autoinst would be broken then it would likely also affect other uploads. However, other uploads (e.g. in the previous test module update_package
) work fine (e.g. https://openqa.suse.de/tests/10730149#step/update_package/63). So since a general problem with the setup is unlikely this speaks for some issue within the SUT (after rebooting it, considering it is being rebooted just in the test module reboot_and_wait_up_normal
before the failure). Maybe it is the simple case of hostname resolution simply not being ready or there's some other misconfiguration. (This is mainly about the case https://openqa.suse.de/tests/10730149#step/host_upgrade_generate_run_file/4. I've seen other types of failures as well. They also look like something's broken on the test side, though.)
Updated by waynechen55 over 1 year ago
mkittler wrote:
What developer mode can do for this case ? I can not figure it out.
You could make the test pause at
host_upgrade_generate_run_file
and investigate the networking problems on the SUT manually.And which case supports this judgement?
If it was a firewall issue or the command server of os-autoinst would be broken then it would likely also affect other uploads. However, other uploads (e.g. in the previous test module
update_package
) work fine (e.g. https://openqa.suse.de/tests/10730149#step/update_package/63). So since a general problem with the setup is unlikely this speaks for some issue within the SUT (after rebooting it, considering it is being rebooted just in the test modulereboot_and_wait_up_normal
before the failure). Maybe it is the simple case of hostname resolution simply not being ready or there's some other misconfiguration. (This is mainly about the case https://openqa.suse.de/tests/10730149#step/host_upgrade_generate_run_file/4. I've seen other types of failures as well. They also look like something's broken on the test side, though.)
- I will do as you instruct
- I think we'd better use type_command => 1 with script_output API in our tests to avoid downloading it from worker machine to decrease failure.
Updated by waynechen55 over 1 year ago
And can not establish sol session to some machines at the moment:
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.gonzo.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.kermit.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H amd-zen3-gpu-sut1-sp.qa.suse.de xxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.scooter.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Updated by mgriessmeier over 1 year ago
waynechen55 wrote:
And can not establish sol session to some machines at the moment:
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.gonzo.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.kermit.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H amd-zen3-gpu-sut1-sp.qa.suse.de xxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.scooter.qa.suse.de xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
There have been reports of general network issues at the moment in Frankencampus office (that's also where those machines are)
I cannot ping any host in this network, but it says "Destination unreachable" - which supports my assumption that som ethings wrong with the whole network.
Updated by mgriessmeier over 1 year ago
works again (for all but amd-zen3...)
matthi@paramore:~(:|✔) # for i in gonzo kermit scooter; do ipmitool -I lanplus -C 3 -H sp.$i.qa.suse.de -U xxx-P xxx chassis power status; done
Chassis Power is on
Chassis Power is on
Chassis Power is on
matthi@paramore:~(:|✔) # ipmitool -I lanplus -C 3 -H amd-zen3-gpu-sut1-sp.qa.suse.de -U xxx -P xxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Updated by waynechen55 over 1 year ago
I did some investigation on three machines, scooter, ix64ph1087 and amd-zen3 with the same test suite. You can refer to this one https://openqa.suse.de/tests/10797074#step/host_upgrade_step2_run/14.
- The "Can not resolve host grenache-1.qa.suse.de" issue still happens to scooter. For the error you can refer to https://openqa.suse.de/tests/10797074#step/host_upgrade_step2_run/14.
- It passed
host_upgrade_generate_run_file
step, because it usestype_command => 1
in here. But it still failed in the next stephost_upgrade_step2_run
which needs to upload log again, so test run failed and scooter was powered off. - In steps
host_upgrade_generate_run_file
andhost_upgrade_step2_run
, I also checked manually and I can not ping grenache-1 and openqa.suse.de from inside scooter. - Next I powered scooter on again, and login to it manually. I still can not ping grenache-1 and openqa.suse.de form inside scooter.
- At last, I found its /etc/resolv.conf is empty as below:
### /etc/resolv.conf file autogenerated by netconfig!
#
# Before you change this file manually, consider to define the
# static DNS configuration using the following variables in the
# /etc/sysconfig/network/config file:
# NETCONFIG_DNS_STATIC_SEARCHLIST
# NETCONFIG_DNS_STATIC_SERVERS
# NETCONFIG_DNS_FORWARDER
# or disable DNS configuration updates via netconfig by setting:
# NETCONFIG_DNS_POLICY=''
#
# See also the netconfig(8) manual page and other documentation.
#
# Note: Manual change of this file disables netconfig too, but
# may get lost when this file contains comments or empty lines
# only, the netconfig settings are same with settings in this
# file and in case of a "netconfig update -f" call.
#
### Please remove (at least) this line when you modify the file!
search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz
- After adding some nameserver entries into this file, I can ping grenache-1 and openqa.suse.de successfully.
- So I think /etc/resolv.conf was emptied after reboot and it can not be populated successfully again.
Updated by cachen over 1 year ago
- Next I powered scooter on again, and login to it manually. I still can not ping grenache-1 and openqa.suse.de form inside scooter.
- At last, I found its /etc/resolv.conf is empty as below:
### /etc/resolv.conf file autogenerated by netconfig! # # Before you change this file manually, consider to define the # static DNS configuration using the following variables in the # /etc/sysconfig/network/config file: # NETCONFIG_DNS_STATIC_SEARCHLIST # NETCONFIG_DNS_STATIC_SERVERS # NETCONFIG_DNS_FORWARDER # or disable DNS configuration updates via netconfig by setting: # NETCONFIG_DNS_POLICY='' # # See also the netconfig(8) manual page and other documentation. # # Note: Manual change of this file disables netconfig too, but # may get lost when this file contains comments or empty lines # only, the netconfig settings are same with settings in this # file and in case of a "netconfig update -f" call. # ### Please remove (at least) this line when you modify the file! search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz
- After adding some nameserver entries into this file, I can ping grenache-1 and openqa.suse.de successfully.
- So I think /etc/resolv.conf was emptied after reboot and it can not be populated successfully again.
Then it looks like sometimes SUT cannot get correct network dns setup from dhcp server? Can we check the dhcp server to see if anything can block it? Thanks a lot!
Updated by mkittler over 1 year ago
- Status changed from Feedback to In Progress
So I think /etc/resolv.conf was emptied after reboot and it can not be populated successfully again.
This is exactly the kind of network misconfiguration within the SUT I was getting at.
Then it looks like sometimes SUT cannot get correct network dns setup from dhcp server?
Either the DHCP/DNS setup in the SUT is misconfigured or it is the DHCP server, indeed. I would suspect the former considering we don't have general issues with DHCP.
What is the VLAN of scooter? The racktables page (https://racktables.suse.de/index.php?object_id=10124&page=object&tab=default) lacks this information. Considering other hosts in the same rack are in VLAN 12 I suspect that it is VLAN 12. That would mean the relevant DHCP server is hosted on qanet.qa.suse.de. The IP of scooter-1.qa.suse.de itself resolves to 10.168.192.87 at this point. I've checked logs on qanet via journalctl | grep -i 10.168.192.87
but couldn't find anything. Maybe that DHCP server is not used after all. I would so some further digging on the SUT/scooter itself. Does it get a network link at all? Is it even configured to make DHCP requests? If so, do you see any problems from the SUT's side?
It looks like scooter-1 is even online and accessible via SSH so I'll have a look to answer those questions.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
scooter is currently executing this job: https://openqa.suse.de/tests/10818998
It is at module sriov_network_card_pci_passthrough
which comes after reboot_after_installation
. The networking setup is currently working just fine. This test was also able to upload logs to grenache-1 after the reboot. So there is not really much to investigate. This means that our DHCP server works in general (unless this test run applies some workaround).
Updated by okurz over 1 year ago
- Parent task set to #115502
mkittler wrote:
What is the VLAN of scooter? The racktables page (https://racktables.suse.de/index.php?object_id=10124&page=object&tab=default) lacks this information. Considering other hosts in the same rack are in VLAN 12 I suspect that it is VLAN 12. That would mean the relevant DHCP server is hosted on qanet.qa.suse.de. The IP of scooter-1.qa.suse.de itself resolves to 10.168.192.87 at this point. I've checked logs on qanet via
journalctl | grep -i 10.168.192.87
but couldn't find anything. Maybe that DHCP server is not used after all. I would so some further digging on the SUT/scooter itself.
That can not be correct. VLAN 12 is a VLAN that is only used within NUE1, i.e. Maxtorhof. The rack with the machine is part of FC Basement. Do you remember when we did the racktables walkthrough over machines and networks? We should revisit, update and complete the information if the entries state the wrong or missing VLAN. See https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Current_management_of_FC_Basement_lab_network_config for information regarding the network of FC Basement.
EDIT: Completing the information regarding IP and network is part of #124637
Does it get a network link at all? Is it even configured to make DHCP requests? If so, do you see any problems from the SUT's side?
It looks like scooter-1 is even online and accessible via SSH so I'll have a look to answer those questions.
Updated by okurz over 1 year ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Support
- Priority changed from Urgent to High
@waynechen55 @cachen We discussed this in our daily infra call 2023-03-30. As far as we can see the DHCP server within FC Basement works as expected. We do not currently have administrative access to the DHCP server. This is planned to be worked on in #125450 and the according SD-tickets linked in there (internal reference for myself: Specifically https://sd.suse.com/servicedesk/customer/portal/1/SD-113959). As explained by mkittler e.g. in #126188#note-10 we assume the problem is within the SUT and the test design where the SUT might not have consistent access to the network yet immediately after bootup.
We suggest you look into that from test perspective and either debug what is visible in system journals of the bootup process or adjust the test code to ensure the test execution only continues after the network initialization is finished.
With that I am reducing prio to High as we feel we did what we could provide from our side so far. Keeping mkittler assigned to support you in the investigation process and answer questions as needed if any.
Updated by cachen over 1 year ago
okurz wrote:
@waynechen55 @cachen We discussed this in our daily infra call 2023-03-30. As far as we can see the DHCP server within FC Basement works as expected. We do not currently have administrative access to the DHCP server. This is planned to be worked on in #125450 and the according SD-tickets linked in there (internal reference for myself: Specifically https://sd.suse.com/servicedesk/customer/portal/1/SD-113959). As explained by 126188 e.g. in #126188#note-10 we assume the problem is within the SUT and the test design where the SUT might not have consistent access to the network yet immediately after bootup.
We suggest you look into that from test perspective and either debug what is visible in system journals of the bootup process or adjust the test code to ensure the test execution only continues after the network initialization is finished.
With that I am reducing prio to High as we feel we did what we could provide from our side so far. Keeping mkittler assigned to support you in the investigation process and answer questions as needed if any.
Understood the challenge if you don't have access permission to dhcp server. Thank you
okurz wrote:
@waynechen55 @cachen We discussed this in our daily infra call 2023-03-30. As far as we can see the DHCP server within FC Basement works as expected. We do not currently have administrative access to the DHCP server. This is planned to be worked on in #125450 and the according SD-tickets linked in there (internal reference for myself: Specifically https://sd.suse.com/servicedesk/customer/portal/1/SD-113959). As explained by mkittler e.g. in #126188#note-10 we assume the problem is within the SUT and the test design where the SUT might not have consistent access to the network yet immediately after bootup.
We suggest you look into that from test perspective and either debug what is visible in system journals of the bootup process or adjust the test code to ensure the test execution only continues after the network initialization is finished.
With that I am reducing prio to High as we feel we did what we could provide from our side so far. Keeping mkittler assigned to support you in the investigation process and answer questions as needed if any.
Understood the challenge if you don't have access permission to dhcp server. Thank you, @okurz @mkittler and @nicksinger for taking care those machines infra stuff and network issues.
I tracked some test jobs, the disconnection between SUT and worker grenache-1.qa.suse.de not reproduce every time. Before tools team can dig into the dhcp/dns settings, just like Oliver suggested, @waynechen55 @xlai let's see if any workaround solution can be placed in test, e.g. add a step to check network setup and correct the nameserver if detecte the missing.
Again, thank you all for the co-work on those challenge tickets, many appreciated!
Updated by waynechen55 over 1 year ago
cachen wrote:
okurz wrote:
@waynechen55 @cachen We discussed this in our daily infra call 2023-03-30. As far as we can see the DHCP server within FC Basement works as expected. We do not currently have administrative access to the DHCP server. This is planned to be worked on in #125450 and the according SD-tickets linked in there (internal reference for myself: Specifically https://sd.suse.com/servicedesk/customer/portal/1/SD-113959). As explained by 126188 e.g. in #126188#note-10 we assume the problem is within the SUT and the test design where the SUT might not have consistent access to the network yet immediately after bootup.
We suggest you look into that from test perspective and either debug what is visible in system journals of the bootup process or adjust the test code to ensure the test execution only continues after the network initialization is finished.
With that I am reducing prio to High as we feel we did what we could provide from our side so far. Keeping mkittler assigned to support you in the investigation process and answer questions as needed if any.
Understood the challenge if you don't have access permission to dhcp server. Thank you
okurz wrote:
@waynechen55 @cachen We discussed this in our daily infra call 2023-03-30. As far as we can see the DHCP server within FC Basement works as expected. We do not currently have administrative access to the DHCP server. This is planned to be worked on in #125450 and the according SD-tickets linked in there (internal reference for myself: Specifically https://sd.suse.com/servicedesk/customer/portal/1/SD-113959). As explained by mkittler e.g. in #126188#note-10 we assume the problem is within the SUT and the test design where the SUT might not have consistent access to the network yet immediately after bootup.
We suggest you look into that from test perspective and either debug what is visible in system journals of the bootup process or adjust the test code to ensure the test execution only continues after the network initialization is finished.
With that I am reducing prio to High as we feel we did what we could provide from our side so far. Keeping mkittler assigned to support you in the investigation process and answer questions as needed if any.
Understood the challenge if you don't have access permission to dhcp server. Thank you, @okurz @mkittler and @nicksinger for taking care those machines infra stuff and network issues.
I tracked some test jobs, the disconnection between SUT and worker grenache-1.qa.suse.de not reproduce every time. Before tools team can dig into the dhcp/dns settings, just like Oliver suggested, @waynechen55 @xlai let's see if any workaround solution can be placed in test, e.g. add a step to check network setup and correct the nameserver if detecte the missing.Again, thank you all for the co-work on those challenge tickets, many appreciated!
- I was considering this yesterday, maybe "netconfig update -f" will do the trick. But this only works if failure is not caused by network breakdown or failure is only caused by temporary dhcp glitch. If there is severe communication issue, nothing will help.
- I think the issue really worth further investigation on infra side. Although it can not be reproduced every time, it happens frequently. The latest Build88.1 test run hit the issue again, please refer to example1 and example2.
Updated by okurz over 1 year ago
waynechen55 wrote:
- I was considering this yesterday, maybe "netconfig update -f" will do the trick. But this only works if failure is not caused by network breakdown or failure is only caused by temporary dhcp glitch. If there is severe communication issue, nothing will help.
I still doubt there is a breakdown on DHCP server side. It might behave differently and hence making such issue more apparent but I still assume it's a problem that tests try to continue too fast when the SUT is not yet ready.
- I think the issue really worth further investigation on infra side. Although it can not be reproduced every time, it happens frequently.
We would look into this issue as soon as Eng-Infra gives us access. Based on previous experiences I assume this will take more months (!). If the issue is happening frequently then I strongly recommend you follow up from test side
Updated by mkittler over 1 year ago
Considering what we've already found out the new ticket #127256 might be the same. It is very explicit about what's not working (nameserver missing from DHCP response) and in the test failures reported here have a matching sympthom. Unfortunately, this doesn't change what @okurz wrote in the last paragraph of the previous comment.
Updated by mkittler over 1 year ago
- Related to action #127256: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M added
Updated by okurz over 1 year ago
- Related to action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added
Updated by openqa_review over 1 year ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: virt-guest-migration-developing-from-developing-to-developing-kvm-dst@virt-mm-64bit-ipmi
https://openqa.suse.de/tests/10933402#step/guest_migration_dst/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by livdywan over 1 year ago
Related branch in progress: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16774
Updated by okurz over 1 year ago
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456 was merged and is deployed to both our DHCP servers walter1.qe.nue2.suse.org and walter2.qe.nue2.suse.org . We assume this fixes the problem.
Updated by mkittler over 1 year ago
It would be nice if you could confirm that it works now.
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
@mkittler please check history of openQA jobs. If you find any obvious errors of the same kind then please note that and work on it otherwise resolve the ticket.
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
- Status changed from In Progress to Resolved
I've checked the scenarios from the ticket description. Some tests pass now. I haven't seen any obvious failures due to name resolution in the recent history. Since I haven't seen the problem anymore on openqaworker1 as well I suppose it can be considered resolved.
Updated by xlai over 1 year ago
mkittler wrote:
I've checked the scenarios from the ticket description. Some tests pass now. I haven't seen any obvious failures due to name resolution in the recent history. Since I haven't seen the problem anymore on openqaworker1 as well I suppose it can be considered resolved.
Thanks for the fix! We will also follow the results. If reproduced again, will share info here and reopen the ticket.