action #94919
closedAll arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S
0%
Description
Observation¶
2021-06-29 The rack NUE SRV2 A8 was switched off by SUSE EngInfra, see https://chat.suse.de/channel/suse-it-ama?msg=WA9YfB7CumBiPfupu and following. The rack includes openqaworker-arm-2 and openqaworker-arm-3. But now also openqaworker-arm-1 is off.
Acceptance criteria¶
- AC1: All three ARM workers are up again
Suggestions¶
- Monitor e.g. https://chat.suse.de/channel/suse-it-ama regarding an update for the rack NUE SRV2 A8 reg. openqaworker-arm-2 and openqaworker-arm-3
- For openqaworker-arm-1 check if machine is reachable over IPMI or contact EngInfra if it is not
Updated by okurz over 3 years ago
- Subject changed from All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra to All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 3 years ago
- Related to action #94940: multiple network related problems, gitlab CI pipelines not working, workers not reachable, proxySCC not reachable added
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by dheidler over 3 years ago
- Status changed from In Progress to Blocked
We tried accessing the IPMI console but for all three workers we can't even ping the IPMI host.
Created a ticket with infra: https://infra.nue.suse.com/SelfService/Display.html?id=191475
Updated by okurz over 3 years ago
- Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Updated by dzedro over 3 years ago
Arm workers are still down, there is infra ticket but I don't see any activity also there.
Can somebody update (if there are any news) or push to get this fixed ?
Updated by okurz over 3 years ago
Yeah I already asked yesterday but there was no update. I have written in the EngInfra ticket now with text
"Hi, unfortunately this ticket did not receive an update yesterday. You mentioned that you would investigate yesterday. Can you please provide an update and also an estimation when we will have access to the machines openqaworker-arm-1/2/3 again as these machines are part of our critical test infrastructure."
Also I wrote in https://chat.suse.de/channel/suse-it-ama?msg=5f92J6mF2bWt76FSe now
Updated by okurz over 3 years ago
Update in https://chat.suse.de/channel/suse-it-ama?msg=KhcQFuiuiZSN4Zu5S
Evzenie Sujskaja @esujskaja all Please be kindly informed, that next week EngInfra team will be working in a low headcount due to bank holidays in CZ and vacations. Some delays are possible, especially for SRV-related tasks. Main focus will operational stable functioning. Thank you for your understanding..
Oliver Kurz @okurz Evzenie Sujskaja good for letting us know. Thank you. Can you help us get an estimate on when machines from the racks NUE-SRV2-A7+8 can be accessible for us again? See https://infra.nue.suse.com/SelfService/Display.html?id=191475
Evzenie Sujskaja @esujskaja Oliver Kurz it's in progress now, fatma ghariani is opening port by port and following the traffic. We should be very careful - we don't know what provoked the storm. We set up storm control on the port - but it can't give 100% safety; and our network is quite vulnerable for such issues. We will do our best to make it back available today or beg next week.
Oliver Kurz @okurz today or next week, thank you
Updated by okurz over 3 years ago
Provided an update in https://infra.nue.suse.com/SelfService/Display.html?id=191475#txn-2929933 . all three machines reachable over IPMI but no network link.
Updated by dheidler over 3 years ago
Update on the INFRA ticket from Fatma Ghariani:
All openQAworkers on that rack are back to UP.
Updated by dheidler over 3 years ago
- Status changed from Blocked to Resolved
I was able to login to all three workers via SSH and there are jobs running on all of them.