action #157528
closed
Remove redundant ASM connections for powerPC machines size:S
Added by nicksinger 9 months ago.
Updated 3 months ago.
Category:
Regressions/Crashes
Description
Motivation¶
Our current hypothesis is that the PPC HMC struggles with two simultaneous connections to the ASM at the same time. It causes the managed system to "flicker" in the webui and constantly abort any operation you execute. We should explore if these connection issues can be resolved by only having one, single connection between ASM<->HMC.
Machines where this happens:
Acceptance criteria¶
Suggestions¶
- Research upstream what IBM suggests. We assume it's not foreseen that one connects more than one physical network connection to the same HMC
- Create an infra ticket according to https://progress.opensuse.org/projects/qa/wiki/Tools#SUSE-IT-ticket-handling asking to remove the secondary, redundant network connection. At best physically remove and update racktables, not in switch config so that not somebody else some months later tries to "fix" a disabled switch port
- Ensure that machines are still controllable over HMC after cable removal
- Ensure that racktables is up-to-date with the remaining connection
- Tags set to infra, ppc, hmc, prg2
- Category set to Regressions/Crashes
- Target version set to Ready
- Parent task set to #123800
For both soapberry and blackcurrant in the HMC I now went to the ASM, removed older temporarily disconnected HMC connection entries and in the HMC remembered the IP, removed the system connection and re-added that one but only that one. If that keeps the HMC connections stable we should ask IT to disconnect the secondary physical ethernet connection.
Haven't seen any flickering anymore so the software workaround seems to have helped.
- Subject changed from Remove redundant ASM connections for powerPC machines to Remove redundant ASM connections for powerPC machines size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to nicksinger
- Due date set to 2024-04-10
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
- Due date deleted (
2024-04-10)
- Status changed from Workable to Blocked
- Assignee set to okurz
- Priority changed from Normal to Low
- Target version changed from Ready to Tools - Next
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
- Priority changed from Low to Normal
- Target version changed from Tools - Next to Ready
- Priority changed from Normal to Low
- Target version changed from Ready to future
- Parent task changed from #123800 to #160520
- Priority changed from Low to Normal
- Target version changed from future to Ready
In https://suse.slack.com/archives/C02CANHLANP/p1723619889118259 rfan brought up an issue in openQA tests which could be related. I observed that within https://powerhmc1.oqa.prg2.suse.org I see a flaky status for recurrant switching between "Operating" and "No connection". We had this problem already in before when multiple interfaces were selected for connection to the HMC. Did you or anyone else recently add a separate HMC connection to redcurrant or run the HMC wizard to find new devices and add all interfaces which could cause this?
- Priority changed from Normal to High
- Status changed from Workable to In Progress
- Assignee set to nicksinger
We might faced the exact same problem again (with different symptoms) in https://progress.opensuse.org/issues/163529#note-27. I try to collect specific ports which infra needs to disconnect. Not sure how we can ensure nobody plugs them in again in the future.
- Due date set to 2024-09-28
Setting due date based on mean cycle time of SUSE QE Tools
- Due date deleted (
2024-09-28)
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
- Status changed from Workable to In Progress
- Assignee set to nicksinger
HMC shows the following connections:
nsinger@powerhmc1:~> lssyscfg -r sys -F name,ipaddr,ipaddr_secondary
grenache,10.255.255.183,10.255.255.95
redcurrant,10.255.255.190,null
haldir,10.255.255.3,null
- Status changed from In Progress to Workable
My idea was to request physical removal of connections and meanwhile disable the secondary ones in the ASM. While writing a request to infra I discovered that haldir is connected to some other private network. I vaguely remember something about a secondary HMC and asked Alvaro if he remembers. My current draft for the SD ticket:
Request to remove physical network connections
Hello,
due to some connection issues with our HMC I’d kindly ask you to remove the secondary fsp connections from the following powerpc machines:
grenache - fsp2, mac: 40:F2:E9:73:5D:55 - https://racktables.suse.de/index.php?page=object&object_id=3120
redcurrant - mac: 98:BE:94:7C:2D:63 - https://racktables.suse.de/index.php?page=object&object_id=11220
- Status changed from Workable to In Progress
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
secondary interface on haldir is something from the old HMC so created https://sd.suse.com/servicedesk/customer/portal/1/SD-168400 to request cable removal. Until then I disable them in the ASM via "Network Services"->"Network Configuration", enable the checkbox "Configure this interface" for the according interface, select "Disabled" for the "IPv4"-field. After clicking safe, confirm with "Save settings" again - there will be no visible feedback if the operation worked and one has to go back into the "Network Configuration" page. Since we ever only used v4 to the HMC I skipped v6 - we might consider it if we still see issues.
To enable all interfaces again the following can be used to conveniently access all ASMs:
ssh ${USER}@powerhmc1.oqa.prg2.suse.org -L 4441:10.255.255.183:443 -L 4442:10.255.255.190:443 -L 4443:10.255.255.3:443
I currently see nothing left to do until the SD ticket is resolved, setting to Blocked.
- Status changed from Blocked to Resolved
SD ticket resolved, all interfaces enabled in ASM again and I saw 169.something IPs which correspond to "auto config" IPs and show that the interfaces have indeed no cable connected.
Also available in: Atom
PDF