action #94919
closed
All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S
Added by okurz over 3 years ago.
Updated over 3 years ago.
Description
Observation¶
2021-06-29 The rack NUE SRV2 A8 was switched off by SUSE EngInfra, see https://chat.suse.de/channel/suse-it-ama?msg=WA9YfB7CumBiPfupu and following. The rack includes openqaworker-arm-2 and openqaworker-arm-3. But now also openqaworker-arm-1 is off.
Acceptance criteria¶
- AC1: All three ARM workers are up again
Suggestions¶
- Monitor e.g. https://chat.suse.de/channel/suse-it-ama regarding an update for the rack NUE SRV2 A8 reg. openqaworker-arm-2 and openqaworker-arm-3
- For openqaworker-arm-1 check if machine is reachable over IPMI or contact EngInfra if it is not
- Subject changed from All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra to All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S
- Description updated (diff)
- Status changed from New to Workable
- Related to action #94940: multiple network related problems, gitlab CI pipelines not working, workers not reachable, proxySCC not reachable added
- Status changed from Workable to In Progress
- Assignee set to dheidler
- Status changed from In Progress to Blocked
- Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Arm workers are still down, there is infra ticket but I don't see any activity also there.
Can somebody update (if there are any news) or push to get this fixed ?
Yeah I already asked yesterday but there was no update. I have written in the EngInfra ticket now with text
"Hi, unfortunately this ticket did not receive an update yesterday. You mentioned that you would investigate yesterday. Can you please provide an update and also an estimation when we will have access to the machines openqaworker-arm-1/2/3 again as these machines are part of our critical test infrastructure."
Also I wrote in https://chat.suse.de/channel/suse-it-ama?msg=5f92J6mF2bWt76FSe now
Update in https://chat.suse.de/channel/suse-it-ama?msg=KhcQFuiuiZSN4Zu5S
Evzenie Sujskaja @esujskaja all Please be kindly informed, that next week EngInfra team will be working in a low headcount due to bank holidays in CZ and vacations. Some delays are possible, especially for SRV-related tasks. Main focus will operational stable functioning. Thank you for your understanding..
Oliver Kurz @okurz Evzenie Sujskaja good for letting us know. Thank you. Can you help us get an estimate on when machines from the racks NUE-SRV2-A7+8 can be accessible for us again? See https://infra.nue.suse.com/SelfService/Display.html?id=191475
Evzenie Sujskaja @esujskaja Oliver Kurz it's in progress now, fatma ghariani is opening port by port and following the traffic. We should be very careful - we don't know what provoked the storm. We set up storm control on the port - but it can't give 100% safety; and our network is quite vulnerable for such issues. We will do our best to make it back available today or beg next week.
Oliver Kurz @okurz today or next week, thank you
Update on the INFRA ticket from Fatma Ghariani:
All openQAworkers on that rack are back to UP.
- Status changed from Blocked to Resolved
I was able to login to all three workers via SSH and there are jobs running on all of them.
Also available in: Atom
PDF