action #157528
closedRemove redundant ASM connections for powerPC machines size:S
0%
Description
Motivation¶
Our current hypothesis is that the PPC HMC struggles with two simultaneous connections to the ASM at the same time. It causes the managed system to "flicker" in the webui and constantly abort any operation you execute. We should explore if these connection issues can be resolved by only having one, single connection between ASM<->HMC.
Machines where this happens:
- soapberry
- blackcurrant
Acceptance criteria¶
- AC1: https://powerhmc1.oqa.prg2.suse.org/ shows no flickering anymore for machines going between "No connection" and "operating"
- AC2: racktables is up-to-date
Suggestions¶
- Research upstream what IBM suggests. We assume it's not foreseen that one connects more than one physical network connection to the same HMC
- Create an infra ticket according to https://progress.opensuse.org/projects/qa/wiki/Tools#SUSE-IT-ticket-handling asking to remove the secondary, redundant network connection. At best physically remove and update racktables, not in switch config so that not somebody else some months later tries to "fix" a disabled switch port
- Ensure that machines are still controllable over HMC after cable removal
- Ensure that racktables is up-to-date with the remaining connection
Updated by okurz 9 months ago
- Tags set to infra, ppc, hmc, prg2
- Category set to Regressions/Crashes
- Target version set to Ready
- Parent task set to #123800
For both soapberry and blackcurrant in the HMC I now went to the ASM, removed older temporarily disconnected HMC connection entries and in the HMC remembered the IP, removed the system connection and re-added that one but only that one. If that keeps the HMC connections stable we should ask IT to disconnect the secondary physical ethernet connection.
Updated by nicksinger 9 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by openqa_review 9 months ago
- Due date set to 2024-04-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 9 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
Unassign due to sick leave. Connections in software already removed, SD-Ticket missing which requests to physically unplug the cables.
Also check https://bugzilla.suse.com/show_bug.cgi?id=1221485
Updated by okurz 9 months ago
- Due date deleted (
2024-04-10) - Status changed from Workable to Blocked
- Assignee set to okurz
- Priority changed from Normal to Low
- Target version changed from Ready to Tools - Next
Good point. I also followed https://bugzilla.suse.com/show_bug.cgi?id=1221485 and I think we can wait for progress in there.
Updated by okurz 9 months ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz) - Priority changed from Low to Normal
- Target version changed from Tools - Next to Ready
https://bugzilla.suse.com/show_bug.cgi?id=1221485#c14 shows that they actually suffered from the same problem, same network connected to both physical HMC ethernet ports which should be avoided. So back to before.
Updated by okurz 4 months ago
- Priority changed from Low to Normal
- Target version changed from future to Ready
In https://suse.slack.com/archives/C02CANHLANP/p1723619889118259 rfan brought up an issue in openQA tests which could be related. I observed that within https://powerhmc1.oqa.prg2.suse.org I see a flaky status for recurrant switching between "Operating" and "No connection". We had this problem already in before when multiple interfaces were selected for connection to the HMC. Did you or anyone else recently add a separate HMC connection to redcurrant or run the HMC wizard to find new devices and add all interfaces which could cause this?
Updated by nicksinger 3 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
We might faced the exact same problem again (with different symptoms) in https://progress.opensuse.org/issues/163529#note-27. I try to collect specific ports which infra needs to disconnect. Not sure how we can ensure nobody plugs them in again in the future.
Updated by openqa_review 3 months ago
- Due date set to 2024-09-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 3 months ago
- Due date deleted (
2024-09-28) - Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
Updated by nicksinger 3 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
HMC shows the following connections:
nsinger@powerhmc1:~> lssyscfg -r sys -F name,ipaddr,ipaddr_secondary
grenache,10.255.255.183,10.255.255.95
redcurrant,10.255.255.190,null
haldir,10.255.255.3,null
Updated by nicksinger 3 months ago
- Status changed from In Progress to Workable
we can physically remove the cable or disable the interface in the ASM. I asked for opinions in: https://suse.slack.com/archives/C02AJ1E568M/p1726587006772399
Updated by nicksinger 3 months ago
My idea was to request physical removal of connections and meanwhile disable the secondary ones in the ASM. While writing a request to infra I discovered that haldir is connected to some other private network. I vaguely remember something about a secondary HMC and asked Alvaro if he remembers. My current draft for the SD ticket:
Request to remove physical network connections
Hello,
due to some connection issues with our HMC I’d kindly ask you to remove the secondary fsp connections from the following powerpc machines:
grenache - fsp2, mac: 40:F2:E9:73:5D:55 - https://racktables.suse.de/index.php?page=object&object_id=3120
redcurrant - mac: 98:BE:94:7C:2D:63 - https://racktables.suse.de/index.php?page=object&object_id=11220
Updated by nicksinger 3 months ago
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
secondary interface on haldir is something from the old HMC so created https://sd.suse.com/servicedesk/customer/portal/1/SD-168400 to request cable removal. Until then I disable them in the ASM via "Network Services"->"Network Configuration", enable the checkbox "Configure this interface" for the according interface, select "Disabled" for the "IPv4"-field. After clicking safe, confirm with "Save settings" again - there will be no visible feedback if the operation worked and one has to go back into the "Network Configuration" page. Since we ever only used v4 to the HMC I skipped v6 - we might consider it if we still see issues.
To enable all interfaces again the following can be used to conveniently access all ASMs:
ssh ${USER}@powerhmc1.oqa.prg2.suse.org -L 4441:10.255.255.183:443 -L 4442:10.255.255.190:443 -L 4443:10.255.255.3:443
I currently see nothing left to do until the SD ticket is resolved, setting to Blocked.
Updated by nicksinger 3 months ago
- Status changed from Blocked to Resolved
SD ticket resolved, all interfaces enabled in ASM again and I saw 169.something IPs which correspond to "auto config" IPs and show that the interfaces have indeed no cable connected.