action #123933
closed[worker][ipmi][bmc] Some worker can not be reached via BMC
0%
Description
Observation¶
Some workers can not be reached via BMC:
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.kermit.qa.suse.de -U xxxxx -P xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.gonzo.qa.suse.de -U xxxxx -P xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H sp.scooter.qa.suse.de -U xxxxx -P xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
waynechen:~ # ipmitool -I lanplus -C 3 -H amd-zen3-gpu-sut1-sp.qa.suse.de -U xxxxx -P xxxxx chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
So only 3 workers are usable for virtualization jobs.
Steps to reproduce¶
- ipmitool -I lanplus xxxxx chassis power status
- Unreachable BMC returns Error: Unable to establish IPMI v2 / RMCP+ session
Impact¶
- Test run with upcoming builds will not finish in a timely manner
- Failure rate goes up significantly
Problem¶
BMC down or network glitch ?
Suggestion¶
- Check BMC or network connection
Workaround¶
There is no workaround for this issue. BMC has to be up and reachable
Updated by okurz almost 2 years ago
- Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
Updated by okurz almost 2 years ago
- Status changed from New to Blocked
- Assignee set to okurz
- Target version set to Ready
I expect the worker is moved to the new lab location and needs be connected to the network yet, see #119551
Updated by waynechen55 almost 2 years ago
okurz wrote:
I expect the worker is moved to the new lab location and needs be connected to the network yet, see #119551
Can this work be done soon ? It light of PublicBeta is approaching, I am considering disable affected workers. Some of them are active in workerconf.sls.
So do you think I need to do this right now or wait for your further feedback ?
Additionally, I expect all machines will keep their current domain name.
Updated by waynechen55 almost 2 years ago
I created a merge request https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/487
Updated by okurz almost 2 years ago
waynechen55 wrote:
okurz wrote:
I expect the worker is moved to the new lab location and needs be connected to the network yet, see #119551
Can this work be done soon ? It light of PublicBeta is approaching, I am considering disable affected workers. Some of them are active in workerconf.sls.
So do you think I need to do this right now or wait for your further feedback ?
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/487 was merged meanwhile. After that we enabled fozzie again with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/488 which is the one machine that is available again.
We expect the network within QE Basement to be available within the next days up to in the worst case some weeks.
Additionally, I expect all machines will keep their current domain name.
Likely machines within FC Basement will receive a new domain name to make it clear where machines are following a consolidated plan from SUSE-IT Eng-Infra applicable for the complete network at FC (Frankencampus) location.
Updated by okurz almost 2 years ago
- Subject changed from [worker][ipmi][bmc] Some woker can not be reached via BMC to [worker][ipmi][bmc] Some worker can not be reached via BMC
Updated by okurz almost 2 years ago
@waynechen55 all four machines sp.kermit.qa.suse.de, sp.gonzo.qa.suse.de, sp.scooter.qa.suse.de, amd-zen3-gpu-sut1-sp.qa.suse.de are controllable over IPMI again.
The machines can also boot over PXE but get the PXE boot menu from an Eng-Infra maintained server, not qanet.
I have https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/493 prepared but unadapted tests would fail right now due to the differing PXE environment. In #119551 we are trying to handle the PXE setup with Eng-Infra to have access to a customizable environment but this likely takes more weeks still. In the meantime what is possible and what is an alternative that can be solved completely from os-autoinst-distri-opensuse perspective without needing any changes to infrastructure or backend would be to use the Eng-Infra supplied PXE boot menu and just boot an older version of the SLES installer (either older build or service pack) and conduct a remote installation of the current build from there. If that is not possible due to kernel mismatch between "linux" file and remote repo content then I suggest to boot an older version of SLES and update to the current build. You can consider doing that.
Updated by okurz almost 2 years ago
- Tags set to infra, ipmi, bmc, FC Basement, lab, PXE
Updated by waynechen55 over 1 year ago
- Status changed from Blocked to Resolved
BMC connection recovered.